RICHARD A. Se DUE ALN (WwW: 
JOHNSON & WICHERN 


Applied Multivariate 
Statistical Analysis 


SIXTH EDITION 


Applied Multivariate 
Statistical Analysis 


RICHARD A. JOHNSON 


University of Wisconsin—Madison 


DEAN W. WICHERN 
Texas A&M University 


PEARSON 


tat ae ee 


“Prent ice 
Hall 


Upper Saddle River, New Jersey 07458 


brary of Congress Cataloging-in- Publication Data 
yhnson, Richard A. 

Statistical analysis/Richard A. Johnson —6" ed. 
Dean W. Winchern 

p.cm. 

Includes index. 

ISBN 0-13-187715-1 

1. Statistical Analysis 


IP Data Available 


“ixecutive Acquisitions Editor: Petra Recter 
Vice President and Editorial Director, Mathematics: Christine Hoag 
toject Manager: Michael Bell 
Production Editor: Debbie Ryan 
venior Managing Editor: Linda Mihatov Behrens 
fanufacturing Buyer: Maura Zaldivar 
Associate Director of Operations: Alexis Heydt-Long 
Aarketing Manager: Wayne Parkins 
Marketing Assistant: Jennifer de Leeuwerk 
fditorial Assistant/Print Supplements Editor: Joanne Wendelken 
Art Director: Jayne Conte 
Director of Creative Service: Paul Belfanti 
Cover Designer: Bruce Kenselaar 
Art Studio: Laserswords 


MESO © 2007 Pearson Education, Inc. 
amen Pearson Prentice Hall 
Pearson Education, Inc. 


Prentice 
Hal Upper Saddle River, NJ 07458 


All rights reserved. No part of this book may be reproduced, in any form or by any means, 
without permission in writing from the publisher. 


Pearson Prentice Hall™ is a trademark of Pearson Education, Inc. 


Printed in the United States of America 
1098 765 432 «421 


ISBN-13: 974-0-13-187715-3 
ISBN-10: O0-13-147715-1 


Pearson Education LTD., London 

Pearson Education Australia PTY, Limited, Sydney 
Pearson Education Singapore, Pte. Ltd 

Pearson Education North Asia Ltd, Hong Kong 
Pearson Education Canada, Ltd., Toronto 

Pearson Educacién de Mexico, S.A. de C.V. 
Pearson Education—Japan, Tokyo 

Pearson Education Malaysia, Pte. Ltd 


To the memory of my mother and my father. 
R.A. J. 


To Dorothy, Michael, and Andrew. 
D. W. W. 


Contents 


PREFACE xv 
1 ASPECTS OF MULTIVARIATE ANALYSIS 1 
1.1 Introduction 1 
1.2 Applications of Multivariate Techniques 3 
1.3. The Organization of Data 5 
Arrays, 5 
Descriptive Statistics, 6 
Graphical Techniques, 11 
1.4 Data Displays and Pictorial Representations 19 
Linking Multiple Two-Dimensional Scatter Plots, 20 
Graphs of Growth Curves, 24 
Stars, 26 
Chernoff Faces, 27 
15 Distance 30 
1.6  FinalComments 37 
Exercises 37 
References 47 
2 MATRIX ALGEBRA AND RANDOM VECTORS 49 
2.1 Introduction 49 
2.2 Some Basics of Matrix and Vector Algebra 49 
Vectors, 49 
Matrices, 54 
2.3 Positive Definite Matrices 60 
24 A Square-Root Matrix 65 
2.5 Random Vectors and Matrices 66 
2.6 Mean Vectors and Covariance Matrices 68 
Partitioning the Covariance Matrix, 73 
The Mean Vector and Covariance Matrix 
for Linear Combinations of Random Variables, 75 
Partitioning the Sample Mean Vector 
and Covariance Matrix, 77 
2.7. Matrix Inequalities and Maximization 78 


viii 


Contents 


Supplement 2A: Vectors and Matrices: Basic Concepts 82 
Vectors, 82 
Matrices, 87 


Exercises 103 
References 110 


3 SAMPLE GEOMETRY AND RANDOM SAMPLING 111 
3.1 Introduction 111 
3.2 The Geometry of the Sample 111 
3.3. Random Samples and the Expected Values of the Sample Mean and 
Covariance Matrix 119 
3.4 Generalized Variance 123 
Situations in which the Generalized Sample Variance Is Zero, 129 
Generalized Variance Determined by | R| 
and Its Geometrical Interpretation, 134 
Another Generalization of Variance, 137 
3.5 Sample Mean, Covariance, and Correlation 
As Matrix Operations 137 
3.6 | Sample Values of Linear Combinations of Variables 140 
Exercises 144 
References 148 
4 THE MULTIVARIATE NORMAL DISTRIBUTION 149 
4.1 Introduction 149 
4.2 The Multivariate Normal Density and Its Properties 149 
Additional Properties of the Multivariate 
Normal Distribution, 156 
4.3 Sampling from a Multivariate Normal Distribution 
and Maximum Likelihood Estimation 168 
The Multivariate Normal Likelihood, 168 
Maximum Likelihood Estimation of p and x, 170 
Sufficient Statistics, 173 
4.4 The Sampling Distribution of X and$ 173 
Properties of the Wishart Distribution, 174 
4.5 Large-Sample Behavior of Xand$ 175 
46 Assessing the Assumption of Normality 177 
Evaluating the Normality of the Univariate Marginal Distributions, 177 
Evaluating Bivariate Normality, 182 
47 Detecting Outliers and Cleaning Data 187 
Steps for Detecting Outliers, 189 
4.8 | Transformations to Near Normality 192 


Transforming Multivariate Observations, 195 
Exercises 200 
References 208 


- Contents 


5 INFERENCES ABOUT A MEAN VECTOR 


5.1 
5.2 


5.3 


5.4 


5.5 
5.6 


5.7 


5.8 


Introduction 210 
The Plausibility of sg as a Value for a Normal 
Population Mean 210 


Hotelling’s T? and Likelihood RatioTests 216 
General Likelihood Ratio Method, 219 


Confidence Regions and Simultaneous Comparisons 
of Component Means 220 
Simultaneous Confidence Statements, 223 
A Comparison of Simultaneous Confidence Intervals 
with One-at-a-Time Intervals, 229 
The Bonferroni Method of Multiple Comparisons, 232 
Large Sample Inferences about a Population Mean Vector 234 


Multivariate Quality Control Charts 239 

Charts for Monitoring a Sample of Individual Multivariate Observations 
for Stability, 241 

Control Regions for Future Individual Observations, 247 

Control Ellipse for Future Observations, 248 

T?.-Chart for Future Observations, 248 

Control Charts Based on Subsample Means, 249 

Control Regions for Future Subsample Observations, 251 

Inferences about Mean Vectors 

when Some Observations Are Missing 251 


Difficulties Due to Time Dependence 
in Multivariate Observations 256 


Supplement 5A: Simultaneous Confidence Intervals and Ellipses 


as Shadows of the p-Dimensional Ellipsoids 258 


Exercises 261 
References 272 


6 COMPARISONS OF SEVERAL MULTIVARIATE MEANS 


6.1 
6.2 


6.3 


6.4 


Introduction 273 


Paired Comparisons and a Repeated Measures Design 273 

Paired Comparisons, 273 

A Repeated Measures Design for Comparing Treatments, 279 

Comparing Mean Vectors from Two Populations 284 

Assumptions Concerning the Structure of the Data, 284 

Further Assumptions When n, and nz Are Small, 285 

Simultaneous Confidence Intervals, 288 

The Two-Sample Situation When X, # 2,291 

An Approximation to the Distribution of T* for Normal Populations 
When Sample Sizes Are Not Large, 294 


Comparing Several Multivariate Population Means 
(One-Way Manova) 296 
Assumptions about the Structure of the Data for One-Way MANOVA, 296 


210 


273 


Contents 


A Summary of Univariate ANOVA, 297 
Multivariate Analysis of Variance (MANOVA), 301 
6.5 Simultaneous Confidence Intervals for Treatment Effects 308 
6.6 Testing for Equality of Covariance Matrices 310 
6.7. Two-Way Multivariate Analysis of Variance 312 
Univariate Two-Way Fixed-Effects Model with Interaction, 312 
Multivariate Two-Way Fixed-Effects Model with Interaction, 315 
6.8 Profile Analysis 323 
6.9 Repeated Measures Designs and Growth Curves 328 
6.10 Perspectives and a Strategy for Analyzing 
Multivariate Models 332 
Exercises 337 
References 358 


7 MULTIVARIATE LINEAR REGRESSION MODELS 360 


7.1 Introduction 360 
7.2. The Classical Linear Regression Model 360 
7.3. Least Squares Estimation 364 
Sum-of-Squares Decomposition, 366 
Geometry of Least Squares, 367 
Sampling Properties of Classical Least Squares Estimators, 369 
7.4  Inferences About the Regression Model 370 
Inferences Concerning the Regression Parameters, 370 
Likelihood Ratio Tests for the Regression Parameters, 374 
7.5  Imnferences from the Estimated Regression Function 378 
Estimating the Regression Function at zo, 378 
Forecasting a New Observation at Zo, 379 
7. Model Checking and Other Aspects of Regression 381 
Does the Model Fit?, 381 
Leverage and Influence, 384 
Additional Problems in Linear Regression, 384 
7.7 Multivariate Multiple Regression 387 
Likelihood Ratio Tests for Regression Parameters, 395 
Other Multivariate Test Statistics, 398 
Predictions from Multivariate Multiple Regressions, 399 
7.8 The Concept of Linear Regression 401 
Prediction of Several Variables, 406 
Partial Correlation Coefficient, 409 
7.9 Comparing the Two Formulations of the Regression Model 410 
Mean Corrected Form of the Regression Model, 410 
Relating the Formulations, 412 
7.10 Multiple Regression Models with Time Dependent Errors 413 


Supplement 7A: The Distribution of the Likelihood Ratio 

for the Multivariate Multiple Regression Model 418 
Exercises~ 420 
References 428 


Contents 


8 PRINCIPAL COMPONENTS 


8.1 
8.2 


8.3 


8.4 
8.5 


8.6 


Introduction 430 


Population Principal Components 430 
Principal Components Obtained from Standardized Variables, 436 
Principal Components for Covariance Matrices 

with Special Structures, 439 
Summarizing Sample Variation by Principal Components 441 
The Number of Principal Components, 444 
interpretation of the Sample Principal Components, 448 
Standardizing the Sample Principal Components, 449 
Graphing the Principal Components 454 
Large Sample Inferences 456 
Large Sample Properties of A; and @;, 456 
Testing for the Equal Correlation Structure, 457 
Monitoring Quality with Principal Components 459 
Checking a Given Set of Measurements for Stability, 459 
Controlling Future Values, 463 
Supplement 8A: The Geometry of the Sample Principal 

Component Approximation 466 

The p-Dimensional Geometrical Interpretation, 468 
The n-Dimensional Geometrical Interpretation, 469 
Exercises 470 
References 480 


9 FACTOR ANALYSIS AND INFERENCE 
FOR STRUCTURED COVARIANCE MATRICES 


9.1 
9.2 
93 


9.4 


9.5 


9.6 


Introduction 481 
The Orthogonal Factor Model 482 
Methods of Estimation 488 
The Principal Component (and Principal Factor) Method, 488 
A Modified Approach—the Principal Factor Solution, 494 
The Maximum Likelihood Method, 495 
A Large Sample Test for the Number of Common Factors, 501 
Factor Rotation 504 . 
Oblique Rotations, 512 
Factor Scores 513 
The Weighted Least Squares Method, 514 
The Regression Method, 516 
Perspectives and a Strategy for Factor Analysis 519 
Supplement 9A: Some Computational Details 

for Maximum Likelihood Estimation 527 
Recommended Computational Scheme, 528 
Maximum Likelihood Estimators of p = L,L, + p, 529 
Exercises 530 
References 538 


xi 


430 


481 


xii Contents 


10 CANONICAL CORRELATION ANALYSIS 539 


10.1 Introduction 539 
10.2 Canonical Variates and Canonical Correlations 539 
10.3 Interpreting the Population Canonical Variables 545 
Identifying the Canonical Variables, 545 
Canonical Correlations as Generalizations 
of Other Correlation Coefficients, 547 
The First r Canonical Variables as a Summary of Variability, 548 
A Geometrical Interpretation of the Population Cartonical 
Correlation Analysis 549 
10.4 The Sample Canonical Variates and Sample 
Canonical Correlations 550 
10.5 Additional Sample Descriptive Measures 558 
Matrices of Errors of Approximations, 558 
Proportions of Explained Sample Variance, 561 
10.6 Large Sample Inferences 563 
Exercises 567 
References 574 


11 DISCRIMINATION AND CLASSIFICATION 575 


11.1 Introduction 575 
11.2 Separation and Classification for Two Populations 576 


11.3 Classification with Two Multivariate Normal Populations 584 
Classification of Normal Populations When X%, = X= %, 584 
Scaling, 589 
Fisher's Approach to Classification with Two Populations, 590 
Is Classification a Good Idea?, 592 
Classification of Normal Populations When X%, # Xz, 593 

11.4 Evaluating Classification Functions 596 


11.5 Classification with Several Populations 606 
The Minimum Expected Cost of Misclassification Method, 606 
Classification with Normal Populations, 609 

11.6 Fisher's Method for Discriminating 
among Several Populations 621 
Using Fisher's Discriminants to Classify Objects, 628 

11.7 Logistic Regression and Classification 634 
Introduction, 634 
The Logit Model, 634 
Logistic Regression Analysis, 636 
Classification, 638 
Logistic Regression with Binomial Responses, 640 

11.8 FinalComments 644 
Including Qualitative Variables, 644 
Classification Trees, 644 
Neural Networks, 647 
Selection of Variables, 648 


Contents 


Testing for Group Differences, 648 

Graphics, 649 

Practical Considerations Regarding Makivariate Normality, 649 
Exercises 650 


References 669 


12 CLUSTERING, DISTANCE METHODS, AND ORDINATION 


12.1 
12.2 


12.3 


12.4 


12.5 
12.6 


12.7 


12.8 


12.9 


Introduction 671 


Similarity Measures 673 
Distances and Similarity Coefficients for Pairs of Items, 673 
Similarities and Association Measures 
for Pairs of Variables, 677 
Concluding Comments on Similarity, 678 
Hierarchical Clustering Methods 680 
Single Linkage, 682 
Complete Linkage, 685 
Average Linkage, 690 
Ward’s Hierarchical Clustering Method, 692 
Final Comments—Hierarchical Procedures, 695 
Nonhierarchical Clustering Methods 696 
K-means Method, 696 
Final Comments—Nonhierarchical Procedures, 701 
Clustering Based on Statistical Models 703 
Multidimensional Scaling 706 
The Basic Algorithm, 708. 
Correspondence Analysis 716 
Algebraic Development of Correspondence Analysis, 718 
Inertia, 725 
Interpretation in Two Dimensions, 726 
Final Comments, 726 
Biplots for Viewing Sampling Units and Variables 726 
Constructing Biplots, 727 
Procrustes Analysis: A Method 
for Comparing Configurations 732 
Constructing the Procrustes Measure of Agreement, 733 
Supplement 12A: Data Mining 740 
Introduction, 740 
The Data Mining Process, 741 
Model Assessment, 742 
Exercises 747 


References 755 


APPENDIX 


DATA INDEX 


SUBJECT INDEX 


xiti 


671 


757 
764 


767 


Preface 


INTENDED AUDIENCE 


LEVEL 


This book originally grew out of our lecture notes for an “Applied Multivariate 
Analysis” course offered jointly by the Statistics Department and the School of 
Business at the University of Wisconsin—Madison. Applied Multivariate Statisti- 
cal Analysis, Sixth Edition, is concerned with statistical methods for describing and 
analyzing multivariate data. Data analysis, while interesting with one variable, 
becomes truly fascinating and challenging when several variables are involved. 
Researchers in the biological, physical, and social sciences frequently collect mea- 
surements on several variables. Modern computer packages readily provide the 
numerical results to rather complex statistical analyses. We have tried to provide 
readers with the supporting knowledge necessary for making proper interpreta- 
tions, selecting appropriate techniques, and understanding their strengths and 
weaknesses. We hope our discussions will meet the needs of experimental scien- 
tists, in a wide variety of subject matter areas, as a readable introduction to the 
Statistical analysis of multivariate observations. 


Our aim is to present the concepts and methods of multivariate analysis at a level 
that is readily understandable by readers who have taken two or more statistics 
courses. We emphasize the applications of multivariate methods and, conse- 
quently, have attempted to make the mathematics as palatable as possible. We 
avoid the use of calculus. On the other hand, the concepts of a matrix and of ma- 
trix manipulations are important. We do not assume the reader is familiar with 
matrix algebra. Rather, we intreduce matrices as they appear naturally in our 
discussions, and we then show how they simplify the presentation of multivari- 
ate models and techniques. 

The introductory account of matrix algebra, in Chapter 2, highlights the 
more important matrix algebra results as they apply to multivariate analysis. The 
Chapter 2 supplement provides a summary of matrix algebra results for those 
with little or no previous exposure to the subject. This supplementary material 
helps make the book self-contained and is used to complete proofs. The proofs 
may be ignored on the first reading. In this way we hope to make the book ac- 
cessible to a wide audience. 

In our attempt to make the study of multivariate analysis appealing to a 
large audience of both practitioners and theoreticians, we have had to sacrifice 


AV 


xvi 


Preface 


a consistency of level. Some sections are harder than others. In particular, we 
have summarized a voluminous amount of material on regression in Chapter 7. 
The resulting presentation is rather succinct and difficult the first time through. 
We hope instructors will be able to compensate for the unevenness in level by ju- 
diciously choosing those sections, and subsections, appropriate for their students 
and by toning them down if necessary. 


ORGANIZATION AND APPROACH 


The methodological “tools” of multivariate analysis are contained in Chapters 5 
through 12. These chapters represent the heart of the book, but they cannot be 
assimilated without much of the material in the introductory Chapters 1 through 
4. Even those readers with a good knowledge of matrix algebra or those willing 
to accept the mathematical results on faith should, at the very least, peruse Chap- 
ter 3, “Sample Geometry,” and Chapter 4, “Multivariate Normal Distribution.” 

Our approach in the methodological chapters is to keep the discussion di- 
rect and uncluttered. Typically, we start with a formulation of the population 
models, delineate the corresponding sample results, and liberally illustrate every- 
thing with examples. The examples are of two types: those that are simple and 
whose calculations can be easily done by hand, and those that rely on real-world 
data and computer software. These will provide an opportunity to (1) duplicate 
our analyses, (2) carry out the analyses dictated by exercises, or (3) analyze the 
data using methods other than the ones we have used or suggested. 

The division of the methodological chapters (5 through 12) into three units 
allows instructors some flexibility in tailoring a course to their needs. Possible 
sequences for a one-semester (two quarter) course are indicated schematically. 

Each instructor will undoubtedly omit certain sections from some chapters 
to cover a broader collection of topics than is indicated by these two choices. 


Getting Started 
Chapters 14 


Inference About Means 
Chapters 5-7 


Classification and Grouping 


Chapters 11 and 12 


Analysis of Covariance 
Structure 
Chapters 8-10 


Analysis of Covariance 
Structure 
Chapters 8-10 


For most students, we would suggest a quick pass through the first four 
chapters (concentrating primarily on the material in Chapter 1; Sections 2.1, 2.2, 
2.3, 2.5, 2.6, and 3.6; and the “assessing normality” material in Chapter 4) fol- 
lowed by a selection of methodological topics. For example, one might discuss 
the comparison of mean vectors, principal components, factor analysis, discrimi- 
nant analysis and clustering. The discussions could feature the many “worked 
out” examples included in these sections of the text. Instructors may rely on di- 


Preface xvii 


agrams and verbal descriptions to teach the corresponding theoretical develop- 
ments. If the students have uniformly strong mathematical backgrounds, much of 
the book can successfully be covered in one term. 

We have found individual data-analysis projects useful for integrating ma- 
terial from several of the methods chapters. Here, our rather complete treatments 
of multivariate analysis of variance (MANOVA), regression analysis, factor analy- 
sis, canonical correlation, discriminant analysis, and so forth are helpful, even 
though they may not be specifically covered in lectures. 


CHANGES TO THE SIXTH EDITION 


New material. Users of the previous editions will notice several major changes 
in the sixth edition. 


¢ Twelve new data sets including national track records for men and women, 
psychological profile scores, car body assembly measurements, cell phone 
tower breakdowns, pulp and paper properties measurements, Mali family 
farm data, stock price rates of return, and Concho water snake data. 


e Thirty seven new exercises and twenty revised exercises with many of these 
exercises based on the new data sets. 


e Four new data based examples and fifteen revised examples. 
e Six new or expanded sections: 
Section 6.6 Testing for Equality of Covariance Matrices 


Section 11.7 Logistic Regression and Classification 
Section 12.5 Clustering Based on Statistical Models 


er Noe 


Expanded Section 6.3 to include “An Approximation to the Distrib- 
ution of T? for Normal Populations When Sample Sizes are not Large” 


5. Expanded Sections 7.6 and 7.7 to include Akaike’s Information Cri- 
terion 


6. Consolidated previous Sections 11.3 and 11.5 on two group discrimi- 
nant analysis into single Section 11.3 


Web Site. To make the methods of multivariate analysis more prominent 
in the text, we have removed the long proofs of Results 7.2,7.4,7.10 and 10.1 
and placed them on a web site accessible through www.prenhall.com/statistics. 
Click on “Multivariate Statistics” and then click on our book. In addition, all 
full data sets saved as ASCII files that are used in the book are available on 
the web site. 


Instructors’ Solutions Manual. An Instructors Solutions Manual is available 
on the author’s website accessible through www prenhall.com/statistics. For infor- 
mation on additional for-sale supplements that may be used with the book or 
additional titles of interest, please visit the Prentice Hall web site at www.pren- 
hall.com. 


ACK 


Preface 


NOWLEDGMENTS 


We thank many of our colleagues who helped improve the applied aspect of the 
book by contributing their own data sets for examples and exercises. A number 
of individuals helped guide various revisions of this book, and we are grateful 
for their suggestions: Christopher Bingham, University of Minnesota; Steve Coad, 
University of Michigan; Richard Kiltie, University of Florida; Sam Kotz, George 
Mason University; Him Koul, Michigan State University; Bruce McCullough, 
Drexel University; Shyamal Peddada, University of Virginia; K. Sivakumar Uni- 
versity of Illinois at Chicago; Eric Smith, Virginia Tech; and Stanley Wasserman, 
University of Illinois at Urbana- Champaign. We also acknowledge the feedback 
of the students we have taught these past 35 years in our applied multivariate 
analysis courses. Their comments and suggestions are largely responsible for the 
present iteration of this work. We would also like to give special thanks to Wai 
Kwong Cheang, Shanhong Guan, Jialiang Li and Zhiguo Xiao for their help with 
the calculations for many of the examples. 

We must thank Dianne Hall for her valuable help with the Solutions Man- 
ual, Steve Verrill for computing assistance throughout, and Alison Pollack for 
implementing a Cheroff faces program. We are indebted to Cliff Gilman for his 
assistance with the multidimensional scaling examples discussed in Chapter 12. 
Jacquelyn Forer did most of the typing of the original draft manuscript, and we 
appreciate her expertise and willingness to endure cajoling of authors faced with 
publication deadlines. Finally, we would like to thank Petra Recter, Debbie Ryan, 
Michael Bell, Linda Behrens, Joanne Wendelken and the rest of the Prentice Hall 
staff for their help with this project. 

R.A. Johnson 
rich@stat.wisc.edu 


D.W. Wichern 
dwichem@tamu.edu 


Applied Multivariate 
Statistical Analysis 


Chapter 


ASPECTS OF MULTIVARIATE 
ANALYSIS 


I.1 Introduction 


Scientific inquiry is an iterative learning process. Objectives pertaining to the expla- 
nation of a social or physical phenomenon must be specified and then tested by 
gathering and analyzing data. In turn, an analysis of the data gathered by experi- 
mentation or observation will usually suggest a modified explanation of the phe- 
nomenon. Throughout this iterative learning process, variables are often added or 
deleted from the study. Thus, the complexities of most phenomena require an inves- 
tigator to collect observations on many different variables. This book is concerned 
with statistical methods designed to elicit information from these kinds of data sets. 
Because the data include simultaneous measurements on many variables, this body 
of methodology is called multivariate analysis. 

The need to understand the relationships between many variables makes multi- 
variate analysis an inherently difficult subject. Often, the human mind is over- 
whelmed by the sheer bulk of the data. Additionally, more mathematics is required 
to derive multivariate statistical techniques for making inferences than in a univari- 
ate setting. We have chosen to provide explanations based upon algebraic concepts 
and to avoid the derivations of statistical results that require the calculus of many 
variables. Our objective is to introduce several useful multivariate techniques in a 
clear manner, making heavy use of illustrative examples and a minimum of mathe- 
matics. Nonetheless, some mathematical sophistication and a desire to think quanti- 
tatively will be required. 

Most of our emphasis will be on the analysis of measurements obtained with- 
out actively controlling or manipulating any of the variables on which the mea- 
surements are made. Only in Chapters 6 and 7 shall we treat a few experimental 
plans (designs) for generating data that prescribe the active manipulation of im- 
portant variables. Although the experimental design is ordinarily the most impor- 
tant part of a scientific investigation, it is frequently impossible to control the 


2 Chapter 1 Aspects of Multivariate Analysis 


generation of appropriate data in certain disciplines. (This is true, for example, in 
business, economics, ecology, geology, and sociology.) You should consult [6] aad 
[7] for detailed accounts of design principles that, fortunately, also apply to multi- 
variate situations. 

It will become increasingly clear that many multivariate methods are based 
upon an underlying probability model known as the multivariate normal distribution 
Other methods are ad hoc in nature and are justified by logical or commonsense 
arguments. Regardless of their origin, multivariate techniques must, invariably. 
be implemented on a computer. Recent advances in computer technology have 
been accompanied by the development of rather sophisticated statistical software 
packages, making the implementation step easier. 

Multivariate analysis is a “mixed bag.” It is difficult to establish a classification 
scheme for multivariate techniques that is both widely accepted and indicates the 
appropriateness of the techniques. One classification distinguishes techniques de- 
signed to study interdependent relationships from those designed to study depen- 
dent relationships. Another classifies techniques according to the number of 
populations and the number of sets of variables being studied. Chapters in this text 
are divided into sections according to inference about treatment means, inference 
about covariance structure, and techniques for sorting or grouping. This should not 
however, be considered an attempt to place each method into a slot. Rather, the 
choice of methods and the types of analyses employed are largely determined by 
the objectives of the investigation. In Section 1.2, we list a smaller number of 
practical problems designed to illustrate the connection between the choice of a sta- 
tistical method and the objectives of the study. These problems, plus the examples in 
the text, should provide you with an appreciation of the applicability of multivariate 
techniques across different fields. 

The objectives of scientific investigations to which multivariate methods most 
naturally lend themselves include the following: 

1. Data reduction or structural simplification. The phenomenon being studied is 
represented as simply as possible without sacrificing valuable information. It is 
hoped that this will make interpretation easier. 

2. Sorting and grouping. Groups of “similar” objects or variables are created 
based upon measured characteristics. Alternatively, rules for classifying objects 
into well-defined groups may be required. 

3. Investigation of the dependence among variables. The nature of the relation- 
ships among variables is of interest. Are all the variables mutually independent 
or are one or more variables dependent on the others? If so, how? 

4. Prediction. Relationships between variables must be determined for the pur- 
pose of predicting the values of one or more variables on the basis of observa- 
tions on the other variables. , 

5. Hypothesis construction and testing. Specific statistical hypotheses, formulated 
in terms of the parameters of multivariate populations, are tested. This may be 
done to validate assumptions or to reinforce prior convictions. 


We conclude this brief overview of multivariate analysis with a quotation from 
FE H.C. Marriott [19], page 89. The statement was made in a discussion of cluster 
analysis, but we feel it is appropriate for a broader range of methods, You should 
keep it in mind whenever you attempt or read about a data analysis. It allows one to 


Applications of Multivariate Techniques 3 


maintain a proper perspective and not be overwhelmed by the elegance of some of 
the theory: 


If the results disagree with informed opinion, do not admit a simple logical interpreta- 
tion, and do not show up clearly in a graphical presentation, they are probably wrong. 
There is no magic about numerical methods, and many ways in which they can break 
down. They are a valuable aid to the interpretation of data, not sausage machines 
automatically transforming bodies of numbers into packets of scientific fact. 


1.2 Applications of Multivariate Techniques 


The published applications of multivariate methods have increased tremendously in 
recent years. It is now difficult to cover the variety of real-world applications of 
these methods with brief discussions, as we did in earlier editions of this book. How- 
ever, in order to give some indication of the usefulness of multivariate techniques, 
we offer the following short descriptions.of the results of studies from several disci- 
plines. These descriptions are organized according to the categories of objectives 
given in the previous section. Of course, many of our examples are multifaceted and 
could be placed in more than one category. 


Data reduction or simplification 


Using data on several variables related to cancer patient responses to radio- 
therapy, a simple measure of patient response to radiotherapy was constructed. 
(See Exercise 1.15.) 


Track records from many nations were used to develop an index of perfor- 
mance for both male and female athletes. (See [8] and [22].) 


Multispectral image data collected by a high-altitude scanner were reduced to a 
form that could be viewed as images (pictures) of a shoreline in two dimensions. 
(See [23].) 

Data on several variables relating to yield and protein content were used to cre- 
ate an index to select parents of subsequent generations of improved bean 
plants. (See [13].) 

A matrix of tactic similarities was developed from aggregate data derived from 
professional mediators. From this matrix the number of dimensions by which 
professional mediators judge the tactics they use in resolving disputes was 
determined. (See [21].) 


Sorting and grouping 


Data on several variables related to computer use were employed to create 
clusters of categories of computer jobs that allow a better determination of 
existing (or planned) computer utilization. (See [2].) 

Measurements of several physiological variables were used to develop a screen- 
ing procedure that discriminates alcoholics from nonalcoholics. (See [26].) 
Data related to responses to visual stimuli were used to develop a rule for sepa- 
rating people suffering from a multiple-sclerosis-caused visual pathology from 
those not suffering from the disease. (See Exercise 1.14.) 


4 Chapter 1 Aspects of Multivariate Analysis 


e The USS. Internal Revenue Service uses data collected from tax returns to sort 
taxpayers into two groups: those that will be audited and those that will not. 
(See [31].) 


Investigation of the dependence among variables 


e Data on several variables were used to identify factors that were responsible for 
client success in hiring external consultants. (See [12].) 

e Measurements of variables related to innovation, on the one hand, and vari- 
ables related to the business environment and business organization, on the 
other hand, were used to discover why some firms are product innovators and 
some firms are not. (See [3].) 

e Measurements of pulp fiber characteristics and subsequent measurements of 
characteristics of the paper made from them are used to examine the relations 
between pulp fiber properties and the resulting paper properties. The goal is to 
determine those fibers that lead to higher quality paper. (See [17].) 

¢ The associations between measures of risk-taking propensity and measures of 
socioeconomic characteristics for top-level business executives were used to 
assess the relation between risk-taking behavior and performance. (See [18].) 


Prediction 


e The associations between test scores, and several high school performance vari- 
ables, and several college performance variables were used to develop predic- 
tors of success in college. (See [10].) 

e Data on several variables related to the size distribution of sediments were used to 
develop rules for predicting different depositional environments. (See [7] and [20].) 

¢ Measurements on several accounting and financial variables were used to de- 
velop a method for identifying potentially insolvent property-liability insurers. 
(See [28].) 

e¢ cDNA microarray experiments (gene expression data) are increasingly used to 
study the molecular variations among cancer tumors. A reliable classification of 
tumors is essential for successful diagnosis and treatment of cancer. (See [9].) 


Hypotheses testing 


¢ Several pollution-related variables were measured to determine whether levels 
for a large metropolitan area were roughly constant throughout the week, or 
whether there was a noticeable difference between weekdays and weekends. 
(See Exercise 1.6.) 

e Experimental data on several variables were used to see whether the nature of 
the instructions makes any difference in perceived risks, as quantified by test 
scores. (See [27].) 

e Data on many variables were used to investigate the differences in structure of 
American occupations to determine the support for one of two competing soci- 
ological theories. (See [16] and [25].) 

e Data on several variables were used to determine whether different types of 
firms in newly industrialized countries exhibited different patterns of innova- 
tion. (See [15].) 


The Organization of Data 5 


The preceding descriptions offer glimpses into the use of multivariate methods 
in widely diverse fields. 


1.3 The Organization of Data 


Throughout this text, we are going to be concerned with analyzing measurements 
made on several variables or characteristics. These measurements (commonly called 
data) must frequently be arranged and displayed in various ways. For example, 
graphs and tabular arrangements are important aids in data analysis. Summary num- 
bers, which quantitatively portray certain features of the data, are also necessary to 
any description. 

We now introduce the preliminary concepts underlying these first steps of data 
organization. 


Arrays 


Multivariate data arise whenever an investigator, seeking to understand a social or 
physical phenomenon, selects a number p = 1 of variables or characters to record. 
The values of these variables are all recorded for each distinct item, individual, or 
experimental unit. 

We will use the notation x;, to indicate the particular value of the kth variable 
that is observed on the jth item, or trial. That is, 


Xj;~ = measurement of the kth variable on the jth item 


Consequently, n measurements on p variables can be displayed as follows: 


Variable1 Variable2 --: Variable k --» Variable p 
Item 1: X11 X42 tee Xx eos X1p 
Item 2: X21 X22 vee X24 one X2p 
Item j: xj Xj2 ee Xie ses i 
Item n: Xn1 Xn2 ate Xnk ee Rae 


Or we can display these data as a rectangular array, called X, of n rows and p 


columns: 
Xiy X12 "Nk Xp 
X21 %22 °°" X2K ""* Xap 
Xyr Xja Xk Xjp 
Xn1 Xn2 °° Ank “°° XAnp 


The array X, then, contains the data consisting of all of the observations on all of 
the variables. 


6 Chapter 1 Aspects of Multivariate Analysis 


Example 1.1 (A data array) A selection of four receipts from a university bookstore 
was obtained in order to investigate the nature of book sales. Each receipt provided, 
among other things, the number of books sold and the total amount of each sale. Let 
the first variable be total dollar sales and the second variable be number of books 
sold. Then we can regard the corresponding numbers on the receipts as four mea- 
surements on two variables. Suppose the data, in tabular form, are 


Variable 1 (dollarsales): 42 52 48 58 
Variable 2(numberofbooks): 4 5 4 3 


Using the notation just introduced, we have 


X14) = 42 X21 = 52 X3, = 48 X4, = 58 


Xy2= 4 X92 = 5 232 = 4 X72 = 3 
and the data array X is 
42 4 
52 5 
X=Ji8 4 
58 3 
with four rows and two columns. m 


Considering data in the form of arrays facilitates the exposition of the subject 
matter and allows numerical calculations to be performed in an orderly and efficient 
manner. The efficiency is twofold, as gains are attained in both (1) describing nu- 
merical calculations as operations on arrays and (2) the implementation of the cal- 
culations on computers, which now use many languages and statistical packages to 
perform array operations. We consider the manipulation of arrays of numbers in 
Chapter 2. At this point, we are concerned only with their value as devices for dis- 
playing data. 


Descriptive Statistics 


A large data set is bulky, and its very mass poses a serious obstacle to any attempt to 
visually extract pertinent information. Much of the information contained in the 
data can be assessed by calculating certain summary numbers, known as descriptive 
statistics. For example, the arithmetic average, or sample mean, is a descriptive sta- 
tistic that provides a measure of Jocation—that is, a “central value” for a set of num- 
bers. And the average of the squares of the distances of all of the numbers from the 
mean provides a measure of the spread, or variation, in the numbers. 

We shall rely most heavily on descriptive statistics that measure location, varia- 
tion, and linear association. The formal definitions of these quantities follow. 


Let x4), 2%21,-..,%X,, be m measurements on the first variable. Then the arith- 
metic average of these measurements is 
7 1X 
oo ers Dz 


j=1 


The Organization of Data 7 


If the m measurements represent a subset of the full set of measurements that 
might have been observed, then X, is also called the sample mean for the first vari- 
able. We adopt this terminology because the bulk of this book is devoted to proce- 
dures designed to analyze samples of measurements from larger collections. 

The sample mean can be computed from the n measurements on each of the 
Pp variables, so that, in general, there will be p sample means: 


(1-1) 


tall 
tod 
i 
M= 
al 
- 
a 
Hl 
a 
N 
vu 


j=1 


A measure of spread is provided by the sample variance, defined for n measure- 
ments on the first variable as 


12 
t= aC ja — X1)° 


where x, is the sample mean of the x;,’s. In general, for p variables, we have 
2 1< x, 7 
Sk = Dy (je — Fe) k =1,2,...,p (1-2) 


Two comments are in order. First, many authors define the sample variance with a 
divisor of m — 1 rather than 1. Later we shall see that there are theoretical reasons 
for doing this, and it is particularly appropriate if the number of measurements, 7, is 
small. The two versions of the sample variance will always be differentiated by dis- 
playing the appropriate expression. 

Second, although the s? notation is traditionally used to indicate the sample 
variance, we shall eventually consider an array of quantities in which the sample vari- 
ances lie along the main diagonal. In this situation, it is convenient to use double 
subscripts on the variances in order to indicate their positions in the array. There- 
fore, we introduce the notation s,, to denote the same variance computed from 
measurements on the kth variable, and we have the notational identities 


Le = 
Sk = Sea = D(x j gem RSL 2 uep GC?) 
j=l 


The square root of the sample variance, Vs,,, is known as the sample standard 
deviation. This measure of variation uses the same units as the observations. 
Consider n pairs of measurements on each of variables 1 and 2: 


es X21 Xn 
" gece acy 
X12 X22 Xn2 


That is, xj; and x; are observed on the jth experimental item (j = 1,2,...,”).A 
measure of linear association between the measurements of vaniables 1 and 2 is pro- 
vided by the sample covariance 


1 ed & 
1277 >» (x41 — X41) (x2 - X2) 


8 Chapter 1 Aspects of Multivariate Analysis 


or the average product of the deviations from their respective means. If large values for 
one variable are observed in conjunction with large values for the other variable, and 
the small values also occur together, s,2 will be positive. If large values from one vari- 
able occur with small values for the other variable, s,. will be negative. If there is no 
particular association between the values for the two variables, s12 will be approxi- 


mately zero. i 
The sample covariance 
1< _ DS ; 
SiR (xji — X:) Xj" — Xx) i=1,2,...,p, k = 1,2,...,p (1-4) 


f=1 


measures the association between the ith and kth variables. We note that the covari- 
ance reduces to the sample variance when i = k. Moreover, 5;, = S,; for all i and k. 

The final descriptive statistic considered here is the sample correlation coeffi- 
cient (or Pearson’s product-moment correlation coefficient, see [14]). This measure 
of the linear association between two variables does not depend on the units of 
measurement. The sample correlation coefficient for the ith and Ath variables is 
defined as 

a 


D (ji — ¥) (xjn - Xe) 


Sik J=1 (1-5) 


FF i — ry al 
Vii V ud id 
He V > Xji- xi)” ApS (Xjx - Xx)" 
iF iz 


fori = 1,2,..., pandk = 1,2,..., p. Note r;, = r,; foralliand k. 

The sample correlation coefficient is a standardized version of the sample co- 
variance, where the product of the square roots of the sample variances provides the 
standardization. Notice that r;, has the same value whether or n — 1 is chosen as 
the common divisor for 5;;, 5,4, and S;,- 

The sample correlation coefficient r;, can also be viewed as a sample covariance. 
Suppose the original values x;; and x,;, are replaced by standardized values 
(xj, — X,)/Vs;; and (x;~ — X4)/Vs~x-The standardized values are commensurable be- 
cause both sets are centered at zero and expressed in standard deviation units. The sam- 
ple correlation coefficient is just the sample covariance of the standardized observations. 

Although the signs of the sample correlation and the sample covariance are the 
same, the correlation is ordinarily easier to interpret because its magnitude is 
bounded. To summanize, the sample correlation r has the following properties: 


1. The value of r must be between —1 and +1 inclusive. 


2. Here r measures the strength of the linear association. If r = 0, this implies a 
lack of linear association between the components. Otherwise, the sign of r indi- 
cates the direction of the association: r < 0 implies a tendency for one value in 
the pair to be larger than its average when the other is smaller than its average; 
and r > 0 implies a tendency for one value of the pair to be large when the 
other value is large and also for both values to be small together. 


3. The value of 7;, remains unchanged if the measurements of the ith variable 
are changed to y,; = ax;; + b,j = 1,2,...,, and the values of the kth vari- 
able are changed to yj, = cx;, + d,j = 1,2,...,n, provided that the con- 
stants a and c have the same sign. 


The Organization of Data 9 


The quantities s;, and r;, do not, in general, convey all there is to know about 
the association between two variables. Nonlinear associations can exist that are not 
revealed by these descriptive statistics. Covariance and correlation provide mea- 
sures of linear association, or association along a line. Their values are less informa- 
tive for other kinds of association, On the other hand, these quantities can be very 
sensitive to “wild” observations (“outliers”) and may indicate association when, in 
fact, little exists. In spite of these shortcomings, covariance and correlation coeffi- 
cients are routinely calculated and analyzed. They provide cogent numerical sum- 
maries of association when the data do not exhibit obvious nonlinear patterns of 
association and when wild observations are not present. 

Suspect observations must be accounted for by correcting obvious recording 
mistakes and by taking actions consistent with the identified causes. The values of 
s;, and 7;, should be quoted both with and without these observations. 

The sum of squares of the deviations from the mean and the sum of cross- 
product deviations are often of interest themselves. These quantities are 


n 
Wik = >, jn ~ Fe)" k=1,2,...,p (1-6) 
j= 
and 
Win = 3 (yi — ED) Ge ~ Fe) i=1,2,...,p, k = 1,2,...,p (1-7) 
j= 


The descriptive statistics computed from n measurements on p variables can 
also be organized into arrays. 


Arrays of Basic Descriptive Statistics 


Xy 
_ x. 
Sample means x=|°? 
Xp 
Syq 542 Sip 
Sample variances a S21 220" Sap (1-8) 
and covariances a ; Bm ean 
Spl Sp2 Spp 
lon <M 
z ig 1 2 
Sample correlations R=] 7). 7 
r ‘pl r p2 1 


Chapter 1 Aspects of Multivariate Analysis 


The sample mean array is denoted by x, the sample variance and covariance 
array by the capital letter S,,, and the sample correlation array by R. The subscript n 
on the array S,, is a mnemonic device used to remind you that n is employed as a di- 
visor for the elements s,,. The size of all of the arrays is determined by the number 
of variables, p. 

The arrays S, and R consist of p rows and p columns. The array x is a single 
column with p rows. The first subscript on an entry in arrays S, and R indicates 
the row; the second subscript indicates the column. Since s,, = s,; and rj, = rx; 
for all i and k, the entries in symmetric positions about the main northwest-— 
southeast diagonals in arrays S, and R are the same, and the arrays are said to be 
symmetric. 


Example 1.2 (The arrays X,S,, and R for bivariate data) Consider the data intro- 
duced in Example 1.1. Each receipt yields a pair of measurements, total dollar 
sales, and number of books sold. Find the arrays x, S,, and R. 

Since there are four receipts, we have a total of four measurements (observa- 
tions) on each variable. 

Thesample means are 


xj1 = 4(42 + 52 + 48 + 58) = 50 


x2 = (4454443) =4 


ws xy 50 
x = a = 
5) 4 
The sample variances and covariances are 


4 
1 = \2 
S11 = 4 > (x1 — X)) 
1 


It 


i 
4((42 — 50)? + (52 — 50)? + (48 — 50)? + (58 — 50)?) = 34 


4 
Ly? 
522 = i 2 (xj2 — X2) 


j=l 
= 4((4 - 4)? + (5 — 4)? + (4-4 + (3-4) =5 
4 
$12 = i > (xj1 ~ X1)(4;2 — X) 
= 
= 4((42 — 50)(4 — 4) + (52 — 50)(5 - 4) 
+ (48 ~ 50) (4 - 4) + (58 — 50)(3 — 4)) = -1.5 
521 = 512 


and 


The Organization of Data {1 


The sample correlation is 


S42 —-15 
rg = ae = ee = -.36 
nm VS11 V522 V34 V5 


so 


Graphical Techniques 


Plots are important, but frequently neglected, aids in data analysis. Although it is im- 
possible to simultaneously plot ai/ the measurements made on several variables and 
study the configurations, plots of individual variables and plots of pairs of variables 
can still be very informative. Sophisticated computer programs and display equip- 
ment allow one the luxury of visually examining data in one, two, or three dimen- 
sions with relative ease. On the other hand, many valuable insights can be obtained 
from the data by constructing plots with paper and pencil. Simple, yet elegant and 
effective, methods for displaying data are available in [29]. It is good statistical prac- 
tice to plot pairs of variables and visually inspect the pattern of association. Consid- 
er, then, the following seven pairs of measurements on two variables: 


Variable 1 (x): 3 4 2 6 8 2 5 

Variable 2 (x): 5 5.5 4 7 10 5 75 

These data are plotted as seven points in two dimensions (each axis represent- 
ing a variable) in Figure 1.1. The coordinates of the points are determined by the 


paired measurements: (3,5), (4, 5.5),-.., (5,7.5). The resulting two-dimensional 
plot is known as a scatter diagram of scatter plot. 


x2 x2 
e710 10 e 
8 8 
e e 
ge 6 6 e 
2 ee ee 
& 
oly, 4 e 
2 2 
| J i Ll, . 
0 2 4 6 8 10? 
e 
eee © ¢ ? 
* 


Figure I.! A scatter plot 
Dot diagram and marginal dot diagrams. 


12 Chapter 1 Aspects of Multivariate Analysis 


Also shown in Figure 1.1 are separate plots of the observed values of variable 1 
and the observed values of variable 2, respectively. These plots are called (marginal) 
dot diagrams. They can be obtained from the original observations or by projecting 
the points in the scatter diagram onto each coordinate axis. 

The information contained in the single-variable dot diagrams can be used to 
calculate the sample means X, and x, and the sample variances s,; and 5,5. (See Ex- 
ercise 1.1.) The scatter diagram indicates the orientation of the points, and their co- 
ordinates can be used to calculate the sample covariance s,,. In the scatter diagram 
of Figure 1.1, large values of x, occur with large values of x. and small values of x, 
with small values of x2. Hence, 52 will be positive. 

Dot diagrams and scatter plots contain different kinds of information. The in- 
formation in the marginal dot diagrams is not sufficient for constructing the scatter 
plot. As an illustration, suppose the data preceding Figure 1.1 had been paired dif- 
ferently, so that the measurements on the variables x, and x, were as follows: 


Variable 1 (2;): 5 4 6 2 2 8 3 
Variable 2 (x2): 5 5.5 4 7 10 5 15 


(We have simply rearranged the values of variable 1.) The scatter and dot diagrams 
for the “new” data are shown in Figure 1.2. Comparing Figures 1.1 and 1.2, we find 
that the marginal dot diagrams are the same, but that the scatter diagrams are decid- 
edly different. In Figure 1.2, large values of x, are paired with small values of x, and 
small values of x, with large values of x,. Consequently, the descriptive statistics for 
the individual variables X,, ¥2, 51,, and s,. remain unchanged, but the sample covari- 
ance s,,, which measures the association between pairs of variables, will now be 
negative. 

The different orientations of the data in Figures 1.1 and 1.2 are not discernible 
from the marginal dot diagrams alone. At the same time, the fact that the marginal 
dot diagrams are the same in the two cases is not immediately apparent from the 
scatter plots. The two types of graphical procedures complement one another; they 
are not competitors. 

The next two examples further illustrate the information that can be conveyed 


by a graphic display. 


2 


e 10 

e 8 

e 

e 6 
ee 

er4 


Figure 1.2 Scatter plot 
and dot diagrams for 
rearranged data. 


The Organization of Data 13 


Example J.3 (The effect of unusual observations on sample correlations) Some fi- 
nancial data representing jobs and productivity for the 16 largest publishing firms 
appeared in an article in Forbes magazine on April 30, 1990. The data for the pair of 
variables x; = employees (jobs) and x2 = profits per employee (productivity) are 
graphed in Figure 1.3. We have labeled two “unusual” observations. Dun & Brad- 
street is the largest firm in terms of number of employees, but is “typical” in terms of 
profits per employee. Time Warner has a “typical” number of employees, but com- 
paratively small (negative) profits per employee. 


e e Dun & Bradstreet 
107- # bg e 


Profits per employee 
(thousands of dollars) 
i] 

S 
T 


i ahaa Figure 1.3 Profits per employee 


0 x 
0 10 20 30 40 50 60 70 80 ' and number of employees for 16 
Employees (thousands) publishing firms. 


The sample correlation coefficient computed from the values of x, and xz is 


—.39 forall 16 firms 
Pats —.56 forall firms but Dun & Bradstreet 
12") -.39 for all firms but Time Warner 
-.50 forall firms but Dun & Bradstreet and Time Warner 


It is clear that atypical observations can have a considerable effect on the sample 
correlation coefficient. = 


Example 1.4 (A scatter plot for baseball data) Ina July 17, 1978, article on money in 
sports, Sports Illustrated magazine provided data on x; = player payroll for Nation- 
al League East baseball teams. 

We have added data on x, = won-lost percentage for 1977. The results are 
given in Table 1.1. 

The scatter plot in Figure 1.4 supports the claim that a championship team can 
be bought. Of course, this cause-effect relationship cannot be substantiated, be- 
cause the experiment did not include a random assignment of payrolls. Thus, statis- 
tics cannot answer the question: Could the Mets have won with $4 million to spend 
on player salaries? 


14 Chapt 


er 1 Aspects of Multivariate Analysis 


Table 1.1 1977 Salary and Final Record for the National League East 
X2 = won-lost 
Team x, = player payroll percentage 
Philadelphia Phillies 3,497,900 .623 
Pittsburgh Pirates 2,485,475 593 
St. Louis Cardinals 1,782,875 512 
Chicago Cubs 1,725,450 500 
Montreal Expos 1,645,575 463 


New York Mets 1,469,800 . 395 


Won-lost percentage 


Figure |.4 Salaries 
x, and won-lost 
0 1.0 2.0 3.0 40 percentage from 
Player payroll in millions of dollars Table 1.1. 


To construct the scatter plot in Figure 1.4, we have regarded the six paired ob- 
servations in Table 1.1 as the coordinates of six points in two-dimensional space. The 
figure allows us to examine visually the grouping of teams with respect to the vari- 
ables total payroll and won-lost percentage. al 


Example 1.5 (Multiple scatter plots for paper strength measurements) Paper is man- 
ufactured in continuous sheets several feet wide. Because of the orientation of fibers 
within the paper, it has a different strength when measured in the direction pro- 
duced by the machine than when measured across, or at right angles to, the machine 
direction. Table 1.2 shows the measured values of 


x, = density (grams/cubic centimeter) 
X2 = strength (pounds) in the machine direction 
x3 = strength (pounds) in the cross direction 
A novel graphic presentation of these data appears in Figure 1.5, page’ 16. The 


scatter plots are arranged as the off-diagonal elements of a covariance array and 
box plots as the diagonal elements. The latter are on a different scale with this 


The Organization of Data [5 


Table 1.2 Paper-Quality Measurements 
Strength 
Specimen Density Machine direction Cross direction 

1 801 121.41 70.42 

2 824 127.70 7247 

3 841 129.20 78.20 

4 816 131.80 74.89 

5 .840 135.10 71.21 

6 842 131.50 78.39 

7 820 126.70 69.02 

8 .802 115.10 73.10 

9 828 130.80 79.28 
10 819 124.60 76.48 
11 826 118.31 70.25 
12 802 114.20 72.88 
13 810 120.30 68.23 
14 802 115.70 68.12 
15 .832 117.51 71.62 
16 796 109.81 53.10 
17 759 109.10 50.85 
18 -770 115.10 51.68 
19 -759 118.31 50.60 
20 772 , 112.60 53.51 
21 806 116.20 56.53 
22 803 118.00 70.70. 
23 845 131.00 74.35 
24 822 125.70 68.29 
25 971 126.10 72.10 
26 816 125.80 70.64 
27 836 125.50 76.33 
28 815 127.80 76.75 
29 822 130.50 80.33 
30 822 127.90 75.68 
31 843 123.90 78.54 
32 824 124.10 71.91 
33 -788 120.80 68.22 
34 .782 107.40 54.42 
35 795 120.70 70.41 
36 .805 121.91 73.68 
37 836 122.31 74.93 
38 -788 110.60 53.52 
39 772 103.51 48.93 
40 .776 110.71 53.67 
41 758 113.80 ; 52.42 


Source: Data courtesy of SONOCO Products Company. 


{6 Chapter 1 Aspects of Multivariate Analysis 


Density Strength (MD) Strength (CD) 


Density 


Strength (MD) 


Strength (CD) 


Figure 1.5 Scatter plots and boxplots of paper-quality data from Table 1.2. 


software, so we use only the overall shape to provide information on symmetry 
and possible outliers for each individual characteristic. The scatter plots can be in- 
spected for patterns and unusual observations. In Figure 1.5, there is one unusual 
observation: the density of specimen 25. Some of the scatter plots have patterns 
suggesting that there are two separate clumps of observations. 

These scatter plot arrays are further pursued in our discussion of new software 
graphics in the next section. = 


Inthe general multiresponse situation, p variables are simultaneously recorded 
on nitems. Scatter plots should be made for pairs of important variables and, if the 
task is not too great to warrant the effort, for all pairs. 

Limited as we are to a three-dimensional world, we cannot always picture an 
entire set of data. However, two further geometric representations of the data pro- 
vide an important conceptual framework for viewing multivariable statistical meth- 
ods. In cases where it is possible to capture the essence of the data in three 
dimensions, these representations can actually be graphed. 


The Organization of Data 17 


n Points in p Dimensions (p-Dimensional Scatter Plot). Consider the natural exten- 
sion of the scatter plot to pdimensions, where the p measurements 


(Xj15 Xj2r---»Xjp) 


on the jth item represent the coordinates of a point in p-dimensional space. The co- 
ordinate axes are taken to correspond to the variables, so that the jth point is xj; 
units along the first axis, x; units along the second, ..., x;, units along the pth axis. 
The resulting plot with n points not only will exhibit the overall pattern of variabili- 
ty, but also will show similarities (and differences) among the n items. Groupings of 
items will manifest themselves in this representation. 

The next example illustrates a three-dimensional scatter plot. 


Example 1.6 (Looking for lower-dimensional structure) A zoologist obtained mea- 
surements on m = 25 lizards known scientifically as Cophosaurus texanus. The 
weight, or mass, is given in grams while the snout-vent length (SVL) and hind limb 
span (HLS) are given in millimeters. The data are displayed in Table 1.3. 

Although there are three size measurements, we can ask whether or not most of 
the variation is primarily restricted to two dimensions or even to one dimension. 

To help answer questions regarding reduced dimensionality, we construct the 
three-dimensional scatter plot in Figure 1.6. Clearly most of the variation is scatter 
about a one-dimensional straight line. Knowing the position on a line along the 
major axes of the cloud of poinfs would be almost as good as knowing the three 
measurements Mass,S VL, and HLS. 

However, this kind of analysis can be misleading if one variable has a much 
larger variance than the others. Consequently, we first calculate the standardized 
values, 2j~ = (Xjx~ — X4)/ Vs,x, SO the variables contribute equally to the variation 


Table 1.3 Lizard Size Data 
Lizard Mass 


5.526 
10.401 
9.213 
8.953 


Source: Data courtesy of Kevin E. Bonine. 


18 


apter 1 ASP 


ects of Multivariate Analysis 


Figure 1.6 3D scatter 
plot of lizard data from 
Table 1.3. 


«1 the scatter plot. Figure 1.7 gives the three-dimensional scatter plot for the stan- 
me rdized variables. Most of the variation can be explained by a single variable de- 
aan by a line through the cloud of points. 


Figure 1.7 3D scatter 
plot of standardized 
lizard data. = 


ss three-dimensional scatter plot can often reveal group structure. 


———— 
mpie 1.7 (Looking for group structure in three dimensions) Referring to Byains 
nes ¢ itis interesting to see if male and female lizards occupy different parts of the 
sree dimensional space containing the size data. The gender, by row, for the lizard 
data in Table 1.3 are 
fmffmfmfmfmfm 


mmmimmmffmff 


Data Displays and Pictorial Representations 19 


Figure 1.8 repeats the scatter plot for the original variables but with males 
marked by solid circles and females by open circles. Clearly, males are typically larg- 
er than females. 


Figure 1.8 3D scatter plot of male and female lizards. m 


p Points in x Dimensions. The n observations of the p variables can also be re- 
garded as p points in n-dimensional space. Each column of X determines one of the 
points. The ith column, 


consisting of all measurements on the ith variable, determines the ith point. 
In Chapter 3, we show how the closeness of points in n dimensions can be relat- 
ed to measures of association between the corresponding variables. 


1.4 Data Displays and Pictorial Representations 


The rapid development of powerful personal computers and workstations has led to 
a proliferation of sophisticated statistical software for data analysis and graphics. It 
is often possible, for example, to sit at one’s desk and examine the nature of multidi- 
mensional] data with clever computer-generated pictures. These pictures are valu- 
able aids in understanding data and often prevent many false starts and subsequent 
inferential problems. 

As we shall see in Chapters 8 and 12, there are several techniques that seek to 
represent p-dimensional observations in few dimensions such that the original dis- 
tances (or similarities) between pairs of observations are (nearly) preserved. In gen- 
eral, if multidimensional observations can be represented in two dimensions, then 
outliers, relationships, and distinguishable groupings can often be discerned by eye. 
We shall discuss and illustrate several methods for displaying multivariate data in 
two dimensions. One good source for more discussion of graphical methods is [11]. 


20 Chapte 


11 Aspects of Multivariate Analysis 


Linking Multiple Two-Dimensional Scatter Plots 


One of the more exciting new graphical procedures involves electronically connect- 
ing many two-dimensional scatter plots. 


———— 
Example 1.8 (Linked scatter plots and brushing) To illustrate linked two-dimensional 
scatter plots, we refer to the paper-quality data in Table 1.2. These data represent 
measurements on the variables x; = density, x. = strength in the machine direction, 
and x3 = strength in the cross direction. Figure 1.9 shows two-dimensional scatter 
plots for pairs of these variables organized as a3 X 3 array. For example, the picture 
in the upper left-hand corner of the figure is a scatter plot of the pairs of observations 
(x1, x3). That is, the x, values are plotted along the horizontal axis, and the x3 values 
are plotted along the vertical axis. The lower right-hand comer of the figure contains a 
scatter plot of the observations (x3, x1). That is, the axes are reversed. Corresponding 
interpretations hold for the other scatter plots in the figure. Notice that the variables 
and their three-digit ranges are indicated in the boxes along the SW-NE diagonal. The 
operation of marking (selecting), the obvious outlier in the (x,, x3) scatter plot of 
Figure 1.9 creates Figure 1.10(a), where the outlier is labeled as specimen 25 and the 
same data point is highlighted in all the scatter plots. Specimen 25 also appears to be 
an outlier in the (x, x2) scatter plot but not in the (x2, x3) scatter plot. The operation 
of deleting this specimen leads to the modified scatter plots of Figure 1.10(b). 

From Figure 1.10, we notice that some points in, for example, the (x2, x3) scatter 
plot seem to be disconnected from the others. Selecting these points, using the 
(dashed) rectangle (see page 22), highlights the selected points in all of the other 
scatter plots and leads to the display in Figure 1.11(a). Further checking revealed 
that specimens 16-21, specimen 34, and specimens 38-41 were actually specimens 


80.3 
Cross 
(x3) 
on 
Af : ‘ 
Machine se" 
(x2) . » 


Figure 1.9 Scatter 
plots for the paper- 
quality data of 
Table 1.2. 


Data Displays and Pictorial Representations 21 


25 


Machine 
( X2) 


Figure 1.10 Modified 
scatter plots for the 
paper-quality data 
with outlier (25) 

(a) selected and 

(b) deleted. 


22 Chapter 1 Aspects of Multivariate Analysis 


Machine 
(x) 


ce Machine . 
carer (x) , ef 
: “, 
B45 Figure 1.11 Modified 

scatter plots with 

Density (a) group of points 

(x) selected and 
(b) points, including 
specimen 25, deleted 


.788 


and the scatter plots 
rescaled. 


Data Displays and Pictorial Representations 23 


from an older roll of paper that was included in order to have enough plies in the 
cardboard being manufactured. Deleting the outlier and the cases corresponding to 
the older paper and adjusting the ranges of the remaining observations leads to the 
scatter plots in Figure 1.11(b). 

The operation of highlighting points corresponding to a selected range of one of 
the variables is called brushing. Brushing could begin with a rectangle, as in Figure 
1.11(a), but then the brush could be moved to provide a sequence of highlighted 
points. The process can be stopped at any time to provide a snapshot of the current 
situation. = 


Scatter plots like those in Example 1.8 are extremely useful aids in data analy- 
sis. Another important new graphical technique uses software that allows the data 
analyst to view high-dimensional data as slices of various three-dimensional per- 
spectives. This can be done dynamically and continuously until informative views 
are obtained. A comprehensive discussion of dynamic graphical methods is avail- 
able in [1]. A strategy for on-line multivariate exploratory graphical analysis, moti- 
vated by the need for a routine procedure for searching for structure in multivariate 
data, is given in [32]. 


Example 1.9 (Rotated plots in three dimensions) Four different measurements of 
lumber stiffness are given in Table 4.3, page 186. In Example 4.14, specimen (board) 
16 and possibly specimen (board) 9 are identified as unusual observations. Fig- 
ures 1.12(a), (b), and (c) contain perspectives of the stiffness data in the x1, x2, 2X3 
space. These views were obtained by continually rotating and turning the three- 
dimensional coordinate axes. Spinning the coordinate axes allows one to get a better 


16 


x2 ° 


*1 . 


43 


(a) Outliers clear. 


(d) Good view of 
(c) Specimen 9 large. Xz, Xz, Xq Space. 


Figure 1.12 Three-dimensiona] perspectives for the lumber stiffness data. 


24 Chapter 1 Aspects of Multivariate Analysis 


understanding of the three-dimensional aspects of the data. Figure 1.12(d) gives 
one picture of the stiffness data in x2, x3, x4 space. Notice that Figures 1.12(a) and 
(d) visually confirm specimens 9 and 16 as outliers. Specimen 9 is very large in all 
three coordinates. A counterclockwiselike rotation of the axes in Figure 1.12(a) 
produces Figure 1.12(b), and the two unusual observations are masked in this view. 
A further spinning of the x2, x3 axes gives Figure 1.12(c); one of the outliers (16) is 
now hidden. 

Additional insights can sometimes be gleaned from visual inspection of the 
slowly spinning data. It is this dynamic aspect that statisticians are just beginning to 
understand and exploit. ; in 


Plots like those in Figure 1.12 allow one to identify readily observations that do 
not conform to the rest of the data and that may heavily influence inferences based 
on standard data-generating models. 


Graphs of Growth Curves 


When the height of a young child is measured at each birthday, the points can be 
plotted and then connected by lines to produce a graph. This is an example of a 
growth curve. In general, repeated measurements of the same characteristic on the 
same unit or subject can give rise to a growth curve if an increasing, decreasing, or 
even an increasing followed by a decreasing, pattern is expected. 


Example 1.10 (Arrays of growth curves) The Alaska Fish and Game Department 
monitors grizzly bears with the goal of maintaining a healthy population. Bears are 
shot with a dart to induce sleep and weighed on a scale hanging from a tripod. Mea- 
surements of length are taken with a steel tape. Table 1.4 gives the weights (wt) in 
kilograms and lengths (Ingth) in centimeters of seven female bears at 2, 3, 4, and 5 
years of age. : 

First, for each bear, we plot the weights versus the ages and then connect the 
weights at successive years by straight lines. This gives an approximation to growth 
curve for weight. Figure 1.13 shows the growth curves for all seven bears. The notice- 
able exception toa common pattern is the curve for bear 5. Is this an outlier or just 
natural variation in the population? In the field, bears are weighed on a scale that 


Table 1.4 Female Bear Data 
Bear Wt2 Wt3 Wt4 WtS Lngth2 Lngth3 Lngth4 LngthS 


1 48 59 95 82 141 157 168 183 
2 59 68 102 102 140 168 174 170 
3 61 77 93 107 145 162 172 177 
4 54 43 104 104 146 159 176 171 
5 100 «145 «185 247 = 150 158 168 175 
6 68 82 95 118 142 140 178 189 
7 68 95 109 111 139 171 176 175 


Source: Data courtesy of H. Roberts. 


Data Displays and Pictorial Representations 25 


Weight 


Figure 1.13 Combined 
growth curves for weight 
for seven female grizzly 
Year bears. 


reads pounds. Further inspection revealed that, in this case, an assistant later failed to 
convert the field readings to kilograms when creating the electronic database. The 
correct weights are (45, 66, 84, 112) kilograms. _ 

Because it can be difficult to inspect visually the individual growth curves in a 
combined plot, the individual curves should be replotted in an array where similari- 
ties and differences are easily observed. Figure 1.14 gives the array of seven curves 
for weight. Some growth curves look linear and others quadratic. 


Bear 1 Bear 2 Bear 3 Bear 4 
150 150 150 150 
‘S 100 © 100 100 & 100 
EY ae = we: = as = ae 
50 50 50 50 
0 0 0 0 
12 3 4 5 I 20-3 AS 12 3 4 § 12 3 4 5 
Year Year Year Year 
Bear 5 Bear 6 Bear 7 
150 150 150 
© 100 ® 100 er 100 yo 
3 = 2 
50 50 50 
0 0 0 
12 3 4 5 1 2-3 4 5 12 3 4 § 
Year Year Year 


Figure 1.14 Individual growth curves for weight for female grizzly bears. 


26 Chapter 1 Aspects of Multivariate Analysis 


Figure 1.15 gives a growth curve array for length. One bear seemed to get shorter 
from 2 to 3 years old, but the researcher knows that the stee] tape measurement of 
length can be thrown off by the bear’s posture when sedated. 


Bear 1 Bear 2 Bear 3 Bear 4 
130 180 180 180 
Ss $s 4 
4 oa on ob 
‘bo = 160 = 160 = 160 
= 160 3 38 3 
140 140 140 140 
12 3 4 5 12 3 4 5 123 4 5 123 4 §5 
Year Year Year Year 
Bear 5 Bear 6 Bear 7 
130 180 180 
s$ 
% 160 ra S 160 § 160 
| wl wl 
140 140 140 
123 45 12 3 4 § 123 4 5 
Year Year Year 
Figure 1.15 Individual growth curves for length for female grizzly bears. m 


We now turn to two popular pictorial representations of multivariate data in 
two dimensions: stars and Chernoff faces. 


Stars 


Suppose each data unit consists of nonnegative observations on p = 2 variables. In 
two dimensions, we can construct circles of a fixed (reference) radius with p equally 
spaced rays emanating from the center of the circle. The lengths of the rays represent 
the values of the variables. The ends of the rays can be connected with straight lines to 
form a star. Each star represents a multivariate observation, and the stars can be 
grouped according to their (subjective) similarities 

It is often helpful, when constructing the stars, to standardize the observations. 
In this case some of the observations will be negative. The observations can then be 
reexpressed so. that the center of the circle represents the smallest standardized 
observation within the entire data set. 


Example 1.11 (Utility data as stars) Stars representing the first 5 of the 22 public 
utility firms in Table 12.4, page 688, are shown in Figure 1.16. There are eight vari- 
ables; consequently, the stars are distorted octagons, 


Data Displays and Pictorial Representations 27 


Arizona Public Service (1) Boston Edison Co. (2) 


Central Louisiana Electric Co. (3) Consolidated Edison Co. (NY) (5) 


1 Commonwealth Edison Co. (4) 


Figure 1.16 Stars for the first five public utilities. 


The observations on all variables were standardized. Among the first five utili- 
ties, the smallest standardized observation for any variable was —1.6. Treating this 
value as zero, the variables are plotted on identical scales along eight equiangular 
rays Originating from the center of the circle. The variables are ordered in a clock- 
wise direction, beginning in the 12 o’clock position. 

At first glance, none of these utilities appears to be similar to any other. However, 
because of the way the stars are constructed, each variable gets equal weight in the vi- 
sual impression. If we concentrate on the variables 6 (sales in kilowatt-hour [k Wh] use 
per year) and 8 (total fuel costs in cents per kWh), then Boston Edison and Consoli- 
dated Edison are similar (small variable 6, \arge variable 8), and Arizona Public Ser- 
vice, Central Louisiana Electric, and Commonwealth Edison are similar (moderate 
variable 6, moderate variable 8). = 


Chernoff Faces 


People react to faces. Chernoff [4] suggested representing p-dimensional observa- 
tions as a two-dimensional face whose characteristics (face shape, mouth curvature, 
nose length, eye size, pupil position, and so forth) are determined by the measure- 
ments on the p variables. 


28 Chapter 1 Aspects of Multivariate Analysis 


As originally designed, Chernoff faces can handle up to 18 variables. The assign- 
ment of variables to facial features is done by the experimenter, and different choic- 
es produce different results. Some iteration is usually necessary before satisfactory 
representations are achieved. 

Chernoff faces appear to be most useful for verifying (1) an initial grouping sug- 
gested by subject-matter knowledge and intuition or (2) final groupings produced 
by clustering algorithms. 


2 
Example 1.12 (Utility data as Chernoff faces) From the data in Table 12.4, the 22 
public utility companies were represented as Chernoff faces. We have the following 


correspondences: 
Variable Facial characteristic 
X,: Fixed-charge coverage <— Half-height of face 
X,: Rate of return on capital <= Face width 
X,; Cost per kW capacity in place = Position of center of mouth 
Sd 


X,, Annual load factor Slant of eyes 


.., (height 
Eccentricity Width of eyes 


Xs; Peak kWh demand growth from 1974 o 

Xz, Sales (kWh use per year) <> Half-length of eye 
Xy: Percent nuclear © Curvature of mouth 
Xz; Total fuel costs (cents per kWh) <> Length of nose 


The Chernoff faces are shown in Figure 1.17. We have subjectively grouped 
“similar” faces into seven Clusters. If a smaller number of clusters is desired, we 
might combine clusters 5,6, and 7 and, perhaps, clusters 2 and 3 to obtain four or five 
clusters. For our assignment of variables to facial features, the firms group largely 
according to geographical location. fs 


Constructing Chernoff faces is a task that must be done with the aid of a com- 
puter. The data are ordinarily standardized within the computer program as part of 
the process for determining the locations, sizes, and orientations of the facial char- 
acteristics. With some training, we can use Chernoff faces to communicate similari- 
ties or dissimilarities, as the next example indicates. 


Example 1.13 (Using Chernoff faces to show changes over time) Figure 1.18 illus- 
trates an additional use of Chernoff faces. (See [24].) In the figure, the faces are used 
to track the financial well-being of a company over time. As indicated, each facial 
feature represents a single financial indicator, and the longitudinal changes in these 
indicators are thus evident at a glance. a 


Data Displays and Pictorial Representations 29 


Cluster | Cluster 2 Cluster 3 Cluster 5 Cluster 7 
iy 2 OD 
4 6 5 7 
ey oer 
10 22 21 {5 
2 
. 13 Cluster 4 Cluster 6 


20 


G 
8EG006 
ae 


WD 
LY 
av, 


Figure 1.17 Chernoff faces for 22 public utilities. 


Figure 1.18 Chernoff faces over time. 


30 Cha 


pter 1 Aspects of Multivariate Analysis 


Chernoff faces have also been used to display differences in multivariate obser- 
vations in two dimensions. For example, the two-dimensional coordinate axes might 
represent latitude and longitude (geographica} location), and the faces might repre- 
sent multivariate measurements on several U.S. cities. Additional examples of this 
kind are discussed in [30]. 

There are several ingenious ways to picture multivariate data in two dimensions. 
We have described some of them. Further advances are possible and will almost 
certainly take advantage of improved computer graphics. 


1.5 Distance 


Although they may at first appear formidable, most multivariate techniques are based 
upon the simple concept of distance. Straight-line, or Euclidean, distance should be 
familiar. If we consider the point P = (x,, x2) in the plane, the straight-line distance, 
d(O, P), from P to the origin O = (0, 0) is, according to the Pythagorean theorem, 


d(O, P) = Vxt+ x3 (1-9) 


The situation is illustrated in Figure 1.19. In general, if the point P has p coordi- 
nates so that P = (x). x2,...,X,), the straight-line distance from P to the origin 


O= (0,0,...,0) 1S 
d(O,P) = Vii tft t+ x3 (1-10) 


(See Chapter 2.) All points (x1, x2,..., X,) that lie a constant squared distance, such 


’ as c?, from the origin satisfy the equation 


@(O,P) = xp txgt-- + x2 =e? (1-11) 


Because this is the equation of a hypersphere (a circle if p = 2), points equidistant 
from the origin lie on a hypersphere. 

The straight-line distance between two arbitrary points P and Q with coordi- 
nates P = (Xj, X2,...,%p) and Q = (y, )y,..., yp) is given by 


d(P,Q) = V(x, - 2 + (2 — ye)? + + (xp — Hy? (1-12) 


Straight-line, or Euclidean, distance is unsatisfactory for most statistical purpos- 
es. This is because each coordinate contributes equally to the calculation of Euclid- 
ean distance. When the coordinates represent measurements that are subject to 
random fluctuations of differing magnitudes, it is often desirable to weight coordi- 
nates subject to a great deal of variability less heavily than those that are not highly 
variable. This suggests a different measure of distance. 

Our purpose now is to develop a “statistical” distance that accounts for differ- 
ences in variation and, in due course, the presence of correlation. Because our 


P 
d(O,P)= (xi +x8 eG 


X 
oO 


Figure 1.19 Distance given 


| »—>| by the Pythagorean theorem. 


Distance 3t 


choice will depend upon the sample variances and covariances, at this point we use 
the term statistical distance to distinguish it from ordinary Euclidean distance. It is 
statistical distance that is fundamental] to multivariate analysis. 

To begin, we take as fixed the set of observations graphed as the p-dimensional 
scatter plot. From these, we shall construct a measure of distance from the origin to 
a point P = (xy, x2,..-,x,). In our arguments, the coordinates (x1, x2,..., Xp) Of P 
can vary to produce different locations for the point. The data that determine dis- 
tance will, however, remain fixed. 

To illustrate, suppose we have n pairs of measurements on two variables each 
having mean zero. Call the variables x, and x2, and assume that the x, measurements 
vary independently of the x. measurements.' In addition, assume that the variability 
in the x, measurements is larger than the variability in the x2 measurements. A scatter 
plot of the data would look something like the one pictured in Figure 1.20. 


Figure }.20 A scatter plot with 
greater variability in the x, direction 
than in the x direction. 


Glancing at Figure 1.20, we see that values which are a given deviation from the 
origin in the x, direction are not as “surprising” or “unusual” as are values equidis- 
tant from the origin in the x, direction. This is because the inherent variability in the 
x, direction is greater than the variability in the x2 direction. Consequently, large x, 
coordinates (in absolute value) are not as unexpected as Jarge x, coordinates. It 
seems reasonable, then, to weight an x2 coordinate more heavily than an x1 coordi- 
nate of the same value when computing the “distance” to the origin. 

One way to proceed is to divide each coordinate by the sample standard devia- 
tion. Therefore, upon division by the standard deviations, we have the “standard- 
ized” coordinates x} = x4/Vs, and x3 = x2/V5s:2. The standardized coordinates 
are now on an equal footing with one another. After taking the differences in vari-- 
ability into account, we determine distance using the standard Euclidean formula. 

Thus, a statistical distance of the point P = (x4, x2) from the origin O = (0,0) can 
be computed from its standardized coordinates x} = x,/Vs,,; and xj = x2/V/s27 as 


d(O, P) = V (xt)? + (x3) 
Waele (ey-vee2 
“VAVau) > Waa) ~ Von” Sn 


‘At this point, “independently” means that the x. measurements cannot be predicted with any 
accuracy from the x, measurements, and vice versa. 


32 Chapter 1 Aspects of Multivariate Analysis 


Comparing (1-13) with (1-9), we see that the difference between the two expres- 
sions is due to the weights k, = 1/s,, and k, = 1/s,) attached to x? and x3 in (1-13). 
Note that if the sample variances are the same, k, = k2, then x? and x3 will receive 
the same weight. In cases where the weights are the same, it is convenient to ignore the 
common divisor and use the usual Euclidean distance formula. In other words, if 
the variability in the.x, direction is the same as the variability in the x, direction, 
and the x, values vary independently of the x, values, Euclidean distance is 
appropriate. 

Using (1-13), we see that all points which have coordinates (x,, x2) and are a 
constant squared distance c? from the origin must satisfy 

xi + xy = c2 


= 1-14 
Sir 522 ( ) 


Equation (1-14) is the equation of an ellipse centered at the origin whose major and 
minor axes coincide with the coordinate axes. That is, the statistical distance in 
(1-13) has an ellipse as the locus of all points a constant distance from the origin. 
This general case is shown in Figure 1.21. 


Figure !.21 The ellipse of constant 
statistical distance 
d?(O, P) = x3/81 + x3/s,2 = c?. 


Example 1.14 (Calculating a statistical distance) A set of paired measurements 
(x;, X2) on two variables yields x, = X. = 0, 5,; = 4, and 5). = 1. Suppose the x, 
measurements are unrelated to the x. measurements; that is, measurements within a 
pair vary independently of one another. Since the sample variances are unequal, we 
measure the square of the distance of an arbitrary point P = (x;, x2) to the origin 


O = (0,0) by 
2 2 
2 41, 2 
=—-4+ 
d°(O, P) 4 1 
All points (x;, x2) that are a constant distance 1 from the origin satisfy the equation 
2 2 
1, 3 
4 1 


The coordinates of some points a unit distance from the origin are presented in the 
following table: 


Distance 33 


2 x2 


Coordinates: (x, , x2) Distance: + c =1 
02 Lz - 
0,1 a ae 
(0,1) Tar ieee. 
02 (-1)? 
0,-1 See pe Ee 
(0, -1) tg 1 
22 0? 
2,0 2, OL 
(2,0) rier ae 
2 (V3/2) 
(1, V3/2) Fer a1 


A plot of the equation x3/4 + x3/1 = 1 is an ellipse centered at (0,0) whose 
major axis lies along the x, coordinate axis and whose minor axis lies along the x2 
coordinate axis. The half-lengths of these major and minor axes are V4 = 2 and 
VI = 1, respectively. The ellipse of unit distance is plotted in Figure 1.22. All points 
on the ellipse are regarded as being the same statistical distance from the origin—in 
this case, a distance of 1. | 


Figure 1.22 Ellipse of unit 


2 2 

. xy x3 
distance, —- + —- = 1. 

ark nae 


The expression in (1-13) can be generalized to accommodate the calculation of 
statistical distance from an arbitrary point P = (x,,x2) to any fixed point 
Q = (1, 2). If we assume that the coordinate variables vary independently of one 
another, the distance from P to Q is given by 


cae 2 m4 2 
d(P,Q) = Fee 4: (2 = ye)" (1-15) 


522 


The extension of this statistical distance to more than two dimensions is 
straightforward. Let the points P and Q have p coordinates such that 
P = (x1, %2,...,Xp) and O = (yj, »,..-, yp). Suppose Q is a fixed point [it may be 
the origin O = (0,0,..., 0)] and the coordinate variables vary independently of one 
another. Let 511, 522,.-., Spp be sample variances constructed from n measurements 


ON X1, X2,.-., Xp, respectively. Then the statistical distance from P to Q is 


d(P,Q) = facet, & ew, 4g Co 46) 


Spp 


chap te: 


11 Aspects of Multivariate Analysis 


All points P that are a constant squared distance from Q lie on a hyperellipsoid 
centered at Q whose major and minor axes are parallel to the coordinate axes. We 
note the following: 


4. The distance of P to the origin O is obtained by setting y, = ») =--- = yp = 0 
in (1-16). 
g. Ifsi; = 522 = °*' = Spp, the Euclidean distance formula in (1-12) is appropriate. 


The distance in (1-16) still does not include most of the important cases we shall 
encounter, because of the assumption of independent coordinates. The scatter plot 
in Figure 1.23 depicts a two-dimensional situation in which the x, measurements do 
not vary independently of the x, measurements. In fact, the coordinates of the pairs 

x4. X2) exhibit a tendency to be large or small together, and the sample correlation 
coefficient is positive. Moreover, the variability in the x2 direction is larger than the 
variability in the x, direction. 

What is a meaningful measure of distance when the variability in the x, direc- 
tion is different from the variability in the x2 direction and the variables x, and x, 
are cotrelated? Actually, we can use what we have already introduced, provided that 
we look at things in the right way. From Figure 1.23, we see that if we rotate the orig- 
jnal coordinate system through the angle @ while keeping the scatter fixed and labe} 
the rotated axes X, and %2, the scatter in terms of the new axes looks very much Jike 
that in Figure 1.20. (You may wish to turn the book to place the ¥, and %, axes in 
their customary positions.) This suggests that we calculate the sample variances 
using the Y, and X, coordinates and measure distance as in Equation (1-13). That is, 
with reference to the X, and %, axes, we define the distance from the point 
p = (%1, ¥2) to the origin O = (0,0) as 


2 ~2 
d(0,P) = fae + = (1-17) 
$11 $22 


where Si; and $22 denote the sample variances computed with the X, and x, 
measurements, 


Figure 1.23 A scatter plot for 
positively correlated 
measurements and a rotated 
coordinate system. 


Distance 35 


The relation between the original coordinates (x;, x2) and the rotated coordi- 
nates (X1, X2) is provided by 
X, = xy cos(@) + x, sin(@ 
- 1cos(8) + x, sin (4) (1-18) 
X, = —x,sin(@) + x2 cos(@) 


Given the relations in (1-18), we can formally substitute for %, and %2 in (1-17) 
and express the distance in terms of the original coordinates. 

After some straightforward algebraic manipulations, the distance from 
P = (%,, X2) to the origin O = (0, 0) can be written in terms of the original coordi- 
nates x, and x, of P as 


d(O, P) = Vaiix} + 2ay2xX2 + Ay 2x5 (1-19) 


where the a’s are numbers such that the distance is nonnegative for all possible val- 
ues of x, and x2. Here a,;, 4,2, and az are determined by the angle 6, and 5,1, 5y2, 
and 54) calculated from the original data.” The particular forms for a,;, @;2, and a22 
are not important at this point. What is important is the appearance of the cross- 
product term 2a,2x}x. necessitated by the nonzero correlation 7. 

Equation (1-19} can be compared with (1-13). The expression in (1-13) can be 
regarded as a special case of (1-19) with a), = 1/s,1, @22 = 1/sy2, and ay) = 0. 

In general, the statistical distance of the point P = (x;, x2) from the fixed pomt 
Q = (1, ) for situations in which the variables are correlated has the general 
form 


d(P,Q) = Vayy(xy — yr)? + 2ay2(1 — y:) (42 ~ ye) + @ra(x2 — yy)? (2-20) 


and can always be computed once a;;, @;2, and a2, are known. In addition, the coor- 
dinates of all points P = (x,, x2) that are a constant squared distance c? from Q 
satisfy 


4y3(x1 — Yi)? + 2aya(xr — 1) (x2 — ye) + @22(%2 — yy)? = c? (1-21) 


By definition, this is the equation of an ellipse centered at Q. The graph of such an 
equation is displayed in Figure 1.24. The major (Jong) and minor (short) axes are in- 
dicated. They are parallel to the X; and X2 axes. For the choice of a;;, a2, aNd a2 in 
footnote 2, the x, and X, axes are at an angle 9 with respect to the x, and x2 axes. 
The generalization of the distance formulas of (1-19) and (1-20) to p dimen- 
sions is straightforward. Let P = (x, X,...,x,) be a point whose coordinates 
represent variables that are correlated and subject to inherent variability. Let 


*Specifically, 
5 cos*(8) sin’(@) 
a cos*(@)s;, + 2sin(6) cos(@)s,2 + sin?(8)s22 e cos?(@)s2 — 2sin(6) cos(@)s,)2 + sin?(@)s)1 
a sin?(@) cos’(6) 
22" cos*(8)s1; + 2 sin(8) cos(@)s;,2 + sin*(@)s,> 7 cos?(@)s22 — 2sin(@) cos(@)s,2 + sin?(@)s1) 
and 
. cos(@) sin(@) sin(@) cos(@) 
12 


~ e0s(8)s;, + 2sin(@)cos(4)s1. + sin?(@)s22 7 cos*(@)s2 — 2sin(@) cos(6)s;2 + sin?(8)sy1 


36 Chapter 1 Aspects of Multivariate Analysis 


Figure 1.24 Ellipse of points 
a constant distance from the 
point Q. 


O = (0,0,...,0) denote the origin, and let Q = (yj, », ..., ¥p) be a specified 
fixed point. Then the distances from P to O and from P to Q have the general 


forms 
2 2 
d(O,P) = Vax} Ht A22XZ Fo + Ag yX i, + 2, 2XjXq + 2ay3xyXz + +++ + 2Ay-1 pXp~1%Xp 
(1-22) 
and 


d(P,Q) = }[ aya (21 — Wy)? + Ga9(x2— Yo)? +--+ + App(Xp ~ Yp)? + Zar2(K1 — 1) (x2 — J) 


+ 2ar3(x1 — a) (X3 — Ys) +++ + 2Qy-y p(Xp-1 ~ Yp-1) (Xp ~ Yp)] 
(1-23) 


where the a’s are numbers such that the distances are always nonnegative.* 

We note that the distances in (1-22) and (1-23) are completely determined by 
the coefficients (weights) @;,,i = 1,2,...,p,k = 1,2,..., p. These coefficients can 
be set out in the rectangular array 


Qi; 412 **" Qip 
4\2 422" A 

: : a) ° (1-24) 
Bp 9p ** App 


where the a,,’s with i # k are displayed twice, since they are multiplied by 2 in the 
distance formulas. Consequently, the entries in this array specify the distance func- 
tions. The a;,’s cannot be arbitrary numbers; they must be such that the computed 
distance is nonnegative for every pair of points. (See Exercise 1.10.) 

Contours of constant distances computed from (1-22) and (1-23) are 
hyperellipsoids. A hyperellipsoid resembles a football] when p = 3; it is impossible 
to visualize in more than three dimensions. 


The algebraic expressions for the squares of the distances in (1-22) and (1-23) are known as gua- 
dratic forms and, in particular, positive definite quadratic forms. It is possible to display these quadratic 
forms in a simpler manner using matrix algebra; we shall do so in Section 2.3 of Chapter 2. 


Exercises 37 


Figure 1.25 A cluster of points 
relative to a point P and the origin. 


The need to consider statistical rather than Euclidean distance is illustrated 
heuristically in Figure 1.25. Figure 1.25 depicts a cluster of points whose center of 
gravity (sample mean) is indicated by the point Q. Consider the Euclidean distances 
from the point Q to the point P and the origin O. The Euclidean distance from Q to 
P is larger than the Euclidean distance from Q to O. However, P appears to be more 
like the points in the cluster than does the origin. If we take into account the vari- 
ability of the points in the cluster and measure distance by the statistical distance in 
(1-20), then Q will be closer to P than to O. This result seems reasonable, given the 
nature of the scatter. 

Other measures of distance can be advanced. (See Exercise 1.12.) At times, it is 
useful to consider distances, that are not related to circles or ellipses. Any distance 
measure d(P,Q) between two points P and Q is valid provided that it satisfies the 
following properties, where R is any other intermediate point: 


a(P,Q) = d(Q, P) 

d(P,Q) > 0ifP #Q 

d(P,Q) =0ifP=Q 

d(P,Q) = d(P,R) + d(R,Q) (triangle inequality) 


(1-25) 


1.6 Final Comments 


Exercises 


We have attempted to motivate the study of multivariate analysis and to provide 
you with some rudimentary, but important, methods for organizing, summarizing, 
and displaying data. In addition, a general concept of distance has been introduced 
that will be used repeatedly in later chapters. 


Consider the seven pairs of measurements (x;, x2) plotted in Figure 1.1: 
Xx] 3 4 2 6 8 2 5 
x2 5 55 4 7 10 5S 75 


Calculate the sample means X, and X2, the sample variances s,, and sj2, and the sample 
covariance $13. 


38 Chapter | Aspects of Multivariate Analysis 


|.2. 


1.3. 


A morning newspaper lists the following used-car prices for a foreign compact with age 
x, measured in years and selling price x, measured in thousands of dollars: 


x {| 1 2 3 3 4 5 6 8 9 It 
x2 | 18.95 19.00 17.95 15.54 14.00 12.95 894 749 6.00 3.99 


(a) Construct a scatter plot of the data and marginal dot diagrams. 
(b) Infer the sign of the sample covariance s, > from the scatter plot. 


(c) Compute the sample means x, and x and the sample variances s,; and 522. Com- 
pute the sample covariance s,, and the sample correlation coefficient 7,2. Interpret 
these quantities. 

(d) Display the sample mean array %, the sample variance-covariance array S,, and the 
sample correlation array R using (1-8). 


The following are five measurements on the variables x,,x2, and x3: 


x 9265 8 
X 128 6 4 10 
xy 3402 1 


Find the arrays x, S,,, and R. 


1.4. The world’s 10 largest companies yield the following data: 


1.5. 


The World’s 10 Largest Companies! 


x, = sales x2 = profits x3 = assets 
Company (billions) (billions) (billions) 
Citigroup 108.28 17.05 1,484.10 
“General Electric 152.36 16.59 750.33 
American Int! Group 95.04 10.91 766.42 
Bank of America 65.45 14.14 1,110.46 
HSBC Group 62.97 9.52 1,031.29 
ExxonMobil 263.99 25.33 195.26 
Royal Dutch/Shell 265.19 18.54 193.83 
BP 285.06 15.73 191.11 
ING Group 92.01 8.10 1,175.16 
Toyota Motor 165.68 11.13 211.15 
‘From www.Forbes.com partially based on Forbes The Forbes Global 2000, 


April 18, 2005. 
(a) Plot the scatter diagram and marginal dot diagrams for variables x; and x2. Com- 
ment on the appearance of the diagrams. 


(b) Compute X), ¥2, 511, 522, 512, and 712. Interpret rj. 


Use the data in Exercise 1.4. 


(a) Plot the scatter diagrams and dot diagrams for (x2, x3) and (x), x3). Comment on 
the patterns. 


(b) Compute the x, S,,, and R arrays for (x1, x2, x3). 


Exercises 39 


1.6. The data in Table 1.5 are 42 measurements on air-pollution variables recorded at 12:00 
noon in the Los Angeles area on different days. (See also the air-pollution data on the 
web at www.prenhall.com/statistics. ) 


(a) Plot the marginal dot diagramis for all the variables. 
(b) Construct the ¥, S,, and R arrays, and interpret the entries in R. 


Table 1.5 Air-Pollution Data 
Solar 
Wind (+) radiation (2) co (x3) NO (x4) NO,(x5) O3 (%6) HC (x7) 
8 98 ae 2 12 8 2 
7 107 4 3 9 5 3 
7 103 4 3 5 6 3 
10 88 5 2 8 15 4 
6 91 4 2 8 10 3 
8 90 5 2 12 12 4 
9 84 7 4 12 15 5 
5 72 6 4 21 14 4 
7 82 5 1 11 11 3 
8 64 5 2 13 9 4 
6 71 5 4 10 3 3 
6 91 4 2 12 7 ' 3 
7 72 7 4 18 10 3 
10 70 4 2 11 vi 3 
10 72 4 1 8 10 3 
9 77 4 1 9 10 3 
8 76 4 1 7 7 3 
8 71 5 3 16 4 
9 67 4 2 13 2 3 
9 69 3 3 9 5 3 
10 62 5 3 14 4 4 
9 88 4 2 7 6 3 
8 80 4 2 13 11 4 
5 30 3 3 5 2 3 
6 83 5 1 10 23 4 
8 84 3 2 7 6 3 
6 78 4 2 11 11 3 
8 79 2 1 7 10 3 
6 62 4 3 9 8 3 
10 37 3 1 7 2 3 
8 71 4 1 10 7 3 
7 52 4 1 12 8 4 
5 48 6 5 8 4 3 
6 75 4 1 10 24 3 
10 35 4 1 6 9 2 
8 85 4 1 9 10 2 
5 86 3 1 6 12 2 
5 86 7 2 13 18 2 
7 79 7 - 4 9 25 3 
7 79 5 2 8 6 2 
6 68 6 2 11 14 3 
8 40 4 3 6 5 2 
Source: Data courtesy of Professor G. C. Tiao. 


40 Chapter 1 Aspects of Multivariate Analysis 


[.7. 


1.10, 


You are given the following n = 3 observations on p = 2 variables: 
Variable 1: x4, =2 x27 =3 2x3, = 4 
Variable 2: x12 =1 x22 =2 432 =4 
(a) Plot the pairs of observations in the two-dimensional “variable space.” That is, con- 
struct a two-dimensional scatter plot of the data. 
(b) Plot the data as two points in the three-dimensional “item space.” 
Evaluate the distance of the point P = (-1, -1) to the point @ = (1,0) using the Eu- 
clidean distance formula in (1-12) with p = 2 and using the statistical distance in (1-20) 


with a,, = 1/3, @22 = 4/27, and a;7-~ 1/9. Sketch the locus of points that are a con- 
stant squared Statistical distance 1 from the point Q. 


Consider the following eight pairs of measurements on two variables x, and X: 
Xy -6 -3 -2 125 6 8 
X2 —2 -3 1 -1 21 «5 3 


(a) Plot the data as a scatter diagram, and compute sy, 522, and 5} 2. 

(b) Using (1-18), calculate the corresponding measurements on variables ¥; and %2, as- 
suming that the original coordinate axes are rotated through an angle of 6 = 26° 
[given cos(26°) = .899 and sin (26°) = .438]. 

(c) Using the ¥; and 2 measurements from (b), compute the sample variances 3), 
and $2. 

(d) Consider the new pair of measurements (x;, x2) = (4, —2). Transform these to 
measurements on %, and X2 using (1-18), and calculate the distance d(O, P) of the 
new point P = (%1, ¥2) from the origin O = (0, 0) using (1-17). 

Note: You will need 5; and 522 from (c). 

(e) Calculate the distance from P = (4, —2) to the origin O = (0, 0) using (1-19) and 

the expressions for a;;, 222, aNd a2 in footnote 2. 

Note: You will need 5,1, 522, and 5,2 from (a). 

Compare the distance calculated here with the distance calculated using the X and X2 
values in (d). (Within rounding error, the numbers should be the same.) 


Are the following distance functions valid for distance from the origin? Explain. 
(a) xt + 4x3 + x122 = (distance )? 
(b) x? — 2x3 = (distance) 


. Verify that distance defined by (1-20) with a;; = 4, a2. = 1, and a,7 = ~1 satisfies the 


first three conditions in (1-25). (The triangle inequality is more difficult to verify.) 


. Define the distance from the point P = (x;, x2) to the origin O = (0,0) as 


d(O, P) = max(|x1|, x2) 
(a) Compute the distance from P = (~3, 4) to the origin. 
(b) Plot the locus of points whose squared distance from the origin is 1. 
(c) Generalize the foregoing distance expression to points in p dimensions. 


. A large city has major roads Jaid out in a grid pattern, as indicated in the following dia- 


gram. Streets 1 through 5 run north-south (NS), and streets 4 through E run east-west 
(EW). Suppose there are retail stores located at intersections (A, 2), (E, 3), and (C, $). 


1.14. 


1.15. 


1.16. 


Exercises 41 


Assume the distance along a street between two intersections in either the NS or EW di- 
rection is 1 unit. Define the distance between any two intersections (points) on the grid 
to be the “city block” distance. [For example, the distance between intersections (D, 1) 
and (C,2), which we might call d((D,1),(C,2)), is given by d((D,1),(C,2)) 
= d((D,1), (D,2)) + d((D,2),(C,2)) =1+1=2. Also, d((D,1),(C,2)) = 
d((D,1),(C,1)) + d((C, 1), (C,2)) =1 +1=2] 


2 ed 
E ee 


Locate a supply facility (warehouse) at an intersection such that the sum of the dis- 
tances from the warehouse to the three retail stores is minimized. 


The following exercises contain fairly extensive data sets. A computer may be necessary for 
the required calculations. 


Table 1.6 contains some of the raw data discussed in Section 1.2. (See also the multiple- 

sclerosis data on the web at www.prenhall.com/statistics.) Two different visual stimuli 

(51 and $2) produced responses in both the left eye (ZL) and the right eye (R) of sub- 

jects in the study groups. The values recorded in the table include x, (subject’s age); x2 

(total response of both eyes to stimulus 51, that is, S1Z + 51R); x3 (difference between 

responses of eyes to stimulus 51, |51Z — S1R]); and so forth. 

(a) Plot the two-dimensional scatter diagram for the variables x. and x, for the 
multiple-sclerosis group. Comment on the appearance of the diagram. 

(b) Compute the x,S,, and R arrays for the non-multiple-sclerosis and multiple- 
Sclerosis groups separately. 


Some of the 98 measurements described in Section 1.2 are listed in Table 1.7 (See also 
the radiotherapy data on the web at www.prenhall.com/statistics.) The data consist of av- 
erage ratings over the course of treatment for patients undergoing radiotherapy. Vari- 
ables measured include x; (number of symptoms, such as sore throat or nausea); x2 
(amount of activity, on a 1-5 scale); x3 (amount of sleep, on a 1-5 scale); x, (amount of 
food consumed, on a 1-3 scale); x5 (appetite, on a 1-5 scale); and x¢ (skin reaction, on a 
0-3 scale). 


(a) Construct the two-dimensional scatter plot for variables x. and x3 and the marginal 
dot diagrams (or histograms). Do there appear to be any errors in the x3 data? 


(b) Compute the x, S,,, and R arrays. Interpret the pairwise correlations. 


At the start of a study to determine whether exercise or dietary supplements would slow 
bone loss in older women, an investigator measured the mineral content of bones by 
photon absorptiometry. Measurements were recorded for three bones on the dominant 
and nondominant sides and are shown in Table 1.8. (See also the mineral-content data 
on the web at www.prenhall.com/statistics.) 

Compute the x, §,, and R arrays. Interpret the pairwise correlations. 


42 Chapter1 Aspects of Multivariate Analysis 


Table 1.6 Multiple-Sclerosis Data 


LN on-Multiple-Sclerosis Group Data 


Subject xy Xz x3 X4 X5 
number (Age) (S1L + S1R) |SIL — S1R| (S2L+ S2R) |S2L ~ S2R] 
[ 1 18 “152.0 1.6 198.4 0 
2 19 138.0 A 180.8 1.6 
3 20 144.0 O 186.4 8 
4 20 143.6 3.2 194.8 0 
5 20 148.8 0 217.6 0 
65 67 154.4 24 205.2 6.0 
66 69 171.2 1.6 210.4 8 
67 73 157.2 4 204.8 0 
68 74 175.2 5.6 235.6 4 
69 79 155.0 14 204.4 0 
Multiple-Sclerosis Group Data 
Subject 
number xy X2 x3 x4 X5 
1 23 148.0 8 205.4 6 
2 25 195.2 3.2 262.8 4 
3 25 158.0 8.0 209.8 12.2 
4 28 134.4 0 198.4 3.2 
5 29 190.2 14.2 243.8 10.6 
25 57 165.6 168 229.2 15.6 
26 58 238.4 8.0 304.4 6.0 
27 58 164.0 8 216.8 8 
28 58 169.8 0 219.2 1.6 
29 59 199.8 4.6 250.2 1.0 
Source: Data courtesy of Dr. G. G. Celesia. 
Table 1.7 Radiotherapy Data 
xy X2 X3 X4 Xs X6 
Symptoms Activity Sleep Eat Appetite Skin reaction 
889 1.389 1.555 2.222 1.945 1.000 
2.813 1.437 999 2.312 2.312 2.000 
1.454 1.091 2.364 2.455 2.909 3.000 
294 94) 1.059 2.000 1.000 1.000 
2.727 2.545 2.819 2.727 4.091 .000 
4.100 1.900 2.800 2.000 2.600 2.000 
125 1.062 1.437 1.875 1.563 .000 
6.231 2.769 1.462 2.385 4.000 2.000 
3.000 1.455 2.090 2.273 3.272 2.000 
889 1.000 1.000 2.000 1.000 2.000 
Source: Data courtesy of Mrs. Annette Tealey, R.N. Values of x2 and x3 less than 1.0 are due to errors 
in the data-collection process. Rows containing values of x2 and x; less than 1.0 may be omitted. 


Exercises 43 


Table 1.8 Mineral Content in Bones 
Subject Dominant Dominant Dominant 
number radius Radius humerus Humerus ulna Ulna 
1 1.103 1,052 2.139 
2 842 859 1.873 
3 925 873 1,887 
4 857 .744 1.739 
5 .795 809 1.734 
6 787 779 1.509 
7 933 880 1.695 
8 .799 851 1.740 
9 945 876 1.811 
10 921 906 1.954 
11 .792 825 1.624 
12 815 751 2.204 
13 .755 .724 1.508 
14 .880 866 1.786 
15 .900 .838 1.902 
16 .764 757 1.743 
17 .733 .748 1.863 
18 932 898 2.028 
19 856 .786 1.390 
20 .890 -950 2.187 
21 688 532 1.650 
22 .940 850 2.334 
23 493 616 1.037 
24 835 .752 1.509 
25 915 .936 1.971 


Source: Data courtesy of Everett Smith. 


a a 


1.17. 


E49. 


Some of the data described in Section 1.2 are listed in Table 1.9. (See also the national- 
track-records data on the web at www.prenhal).com/statistics.) The national track 
records for women in 54 countries can be examined for the relationships among the run- 
ning events. Compute the x, S,,, and R arrays. Notice the magnitudes of the correlation 
coefficients as you go from the shorter (100-meter) to the longer (marathon) running 
distances. Interpret ihese pairwise correlations. 


. Convert the national track records for women in Table 1.9 to speeds measured in meters 


per second. For example, the record speed for the 100-m dash for Argentinian women is 
100 m/11.57 sec = 8.643 m/sec. Notice that the records for the 800-m, 1500-m, 3000-m 
and marathon runs are measured in minutes. The marathon is 26.2 miles, or 42,195 
meters, long. Compute the X, S,,, and R arrays. Notice the magnitudes of the correlation 
coefficients as you go from the shorter (100 m) to the longer (marathon) running distances. 
Interpret these pairwise correlations. Compare your results with the results you obtained 
in Exercise 1.17. 


Create the scatter plot and boxplot displays of Figure 1.5 for (a) the mineral-content 
data in Table 1.8 and (b) the national-track-records data in Table 1.9. 


44 Chapter 1 Aspects of Multivariate Analysis 


Table 1.9 National Track Records for Women 
100m 200m 400m 800m 1500m 3000m = Marathon 

Country (s) (s) (s) (min) (min) (min) (min) 
Argentina 11.57 22.94 52.50 2.05 4.25 9.19 150.32 
Australia 11.12 ~22.23 48.63 1.98 4.02 8.63 143.51 
Austria 11.15 22.70 50.62 1.94 4.05 8.78 154.35 
Belgium 11.14 22.48 $1.45 1.97 4.08 8.82 143.05 
Bermuda 11.46 23.05 53.30 2.07 4.29 9.81 174.18 
Brazil 11.17 22.60 50.62 1.97 4.17 9.04 147.41 
Canada 10.98 22.62 49.91. 1.97 4.00 8.54 148.36 
Chile 11.65 23.84 53.68 2.00 4.22 9.26 152.23 
China 10.79 22.01 49.81 1.93 3.84 8.10 139.39 
Columbia 11.31 22.92 49.64 2.04 4,34 9.37 155.19 
Cook Islands 12.52 25.91 61.65 2.28 4.82 11.10 212.33 
Costa Rica 11.72 23.92 52.57 2.10 4.52 9.84 164.33 
Czech Republic 11.09 21.97 47.99 1.89 4.03 8.87 145.19 
Denmark 11.42 23.36 §2.92 2.02 4,12 8.71 149.34 
Dominican Republic 11.63 23.91 53.02 2.09 4.54 9.89 166.46 
Finland 11.13 22.39 50.14 2.01 4.10 8.69 148.00 
France 10.73 21.99 48.25 1.94 4.03 8.64 148.27 
Germany 10.81 21.71 47.60 1.92 3.96 8.51 141.45 
Great Britain 11.10 22.10 49.43 1.94 3.97 8.37 135.25 
Greece 10.83 22.67 50.56 2.00 4.09 8.96 153.40 
Guatemala 11.92 24.50 55.64 2.15 4.48 9.71 171.33 
Hungary 11.41 23.06 $1.50 1.99 4.02 8.55 148.50 
India 11.56 23.86 55.08 2.10 4.36 9.50 154.29 
Indonesia 11.38 22.82 $1.05 2.00 4.10 9.11 158.10 
Ireland 11.43 23.02 51.07 2.01 3.98 8.36 142.23 
Israel 11.45 23.15 52.06 2.07 4.24 9.33 156.36 
Italy 11.14 2260 5131 196 3.98 8.59 143.47 
Japan 11.36 23.33 51.93 2.01 4.16 8.74 139.41 
Kenya 11.62 23.37 $1.56 ~° 1.97 3.96 8.39 138.47 
Korea, South 11.49 23.80 $3.67 2.09 424 9.01 146.12 
Korea, North 11.80 25.10 $6.23 1.97 4.25 8.96 145.31 
Luxembourg 11.76 23.96 56:07 2.07 4.35 9.21 149.23 
Malaysia 11.50 23.37 $2.56 2.12 439 9.31 169.28 
Mauritius 11.72 23.83 54.62 2.06 433 9.24 167.09 
Mexico 11.09 23.13 48.89 2.02 4.19 8.89 144.06 
Myanmar(Burma) 11.66 23.69 52.96 2.03 4.20 9.08 158.42 
Netherlands 11.08 22.81 $1.35 1.93 4.06 8.57 143.43 
New Zealand 11.32 23.13 51.60 1.97 4.10 8.76 146.46 
Norway 11.41 23.31 52.45 2.03 4.01 8.53 141.06 
Papua New Guinea 11.96 24.68 $5.18 2.24 4.62 10.21 221.14 
Philippines 11.28 23.35 $4.75 2.12 4.41 9.81 165.48 
Poland 10.93 22.13 49.28 1.95 3.99 8.53 144.18 
Portugal 11.30 22.88 $1.92 1.98 3.96 8.50 143.29 
Romania 11.30 22.35 49.88 1.92 3.90 8.36 142.50 
Russia 10.77 21.87 49.11 1.91 3.87 8.38 141.31 
Samoa 12.38 25.45 $6.32 2.29 $.42 13.12 191.58 


(continues) 


Exercises 45 


400m 800m 1500m 3000m Marathon 
Country (s) (s) (s) (min) (min) (min) (min) 


Singapore 12.13 24.54 55.08 2.12 4.52 9.94 154.41 
Spain 11.06 22,38 49.67 1.96 4.01 8.48 146.51 
Sweden 11.16 22.82 51.69 1.99 4.09 8.81 150.39 
Switzerland 11.34 = 22.88 51.32 1.98 3.97 8.60 145.51 
Taiwan 11.22 = 22.56 52.74 2.08 4.38 9.63 159.53 
. Thailand 11.33, 23.30 = 52.60 2.06 4.38 10.07 162.39 
Turkey 11.25 22.71 53.15 2.01 3.92 8.53 151.43 
US.A. 10.49 21.34 48.83 1.94 3.95 8.43 141.16 


Source: [AAF/ATFS Track and Field Handbook for Helsinki 2005 (courtesy of Ottavio Castellini). 


1.20. Refer to the bankruptcy data in Table 11.4, page 657, and on the following website 
www.prenhall.com/statistics. Using appropriate computer software, 
(a) View the entire data set in x,, x2, x3 Space. Rotate the coordinate axes in various 
directions. Check for unusual observations. 


(b) Highlight the set of points corresponding to the bankrupt firms. Examine various 
three-dimensional perspectives. Are there some orientations of three-dimensional 
space for which the bankrupt firms can be distinguished from the nonbankrupt 
firms? Are there observations in each of the two groups that are likely to have a sig- 
nificant impact on any rule developed to classify firms based on the sample means, 
variances, and covariances Calculated from these data? (See Exercise 11.24.) 


1.21. Refer to the milk transportation-cost data in Table 6.10, page 345, and on the web at 
www.prenhall.comystatistics. Using appropriate computer software, 

(a) View the entire data set in three dimensions. Rotate the coordinate axes in various 
directions. Check for unusual observations. 

(b) Highlight the set of points corresponding to gasoline trucks. Do any of the gasoline- 
truck points appear to be multivariate outliers? (See Exercise 6.17.) Are there some 
orientations of x,,x2,x3 space for which the set of points representing gasoline 
trucks can be readily distinguished from the set of points representing diesel trucks? 


1.22. Refer to the oxygen-consumption data in Table 6.12, page 348, and on the web at 
www.prenhall.com/statistics. Using appropriate computer software, 
(a) View the entire data set in three dimensions employing various combinations of 
three variables to represent the coordinate axes. Begin with the x,, x2, x3 space. 
(b) Check this data set for outliers. 


1.23. Using the data in Table 11.9, page 666, and on the web at www.prenhall.com/ 
Statistics, represent the cereals in each of the following ways. 
(a) Stars. 
(b) Chernoff faces. (Experiment with the assignment of variables to facial characteristics.) 
1.24. Using the utility data in Table 12.4, page 688, and on the web at www.prenhall. 
conystatistics, represent the public utility companies as Chernoff faces with assign- 
ments of variables to facial characteristics different from those considered in Exam- 
ple 1.12. Compare your faces with the faces in Figure 1.17. Are different groupings 
indicated? 


46 Chapter1 Aspects of Multivariate Analysis 


1.25. Using the data in Table 12.4 and on the web at www.prenhall.com/statistics, represent the 
22 public utility companies as stars. Visually group the companies into four or five 
clusters. 


1.26. The data in Table 1.10 (see the bull data on the web at www.prenhall.com/statistics) are 
the measured characteristics of 76 young (less than two years old) bulls sold at auction. 
Also included in the table are the selling prices (SalePr) of these bulls. The column head- 
ings (variables) are defined as follows: 


1 Angus = : : 
Hiesd= 45 Hereford YrHgt = Yearling height at 
: shoulder (inches) 
8 Simental 
FtFrBody = Fat free body PrctFFB = Percent fat-free 
(pounds) body 
Frame = Scale from 1 (small) BkFat = Back fat 
to 8 (large) (inches) 
SaleHt = Sale height at SaleWt = Sale weight 
shoulder (inches) (pounds) 


(a) Compute the x, S,, and R arrays. Interpret the pairwise correlations. Do some of 
these variables appear to distinguish one breed from another? 

(b) View the data in three dimensions using the variables Breed, Frame, and BkFat. Ro- 
tate the coordinate axes in various directions. Check for outliers. Are the breeds well 
separated in this coordinate system? 

(c) Repeat part b using Breed, FtFrBody, and SaleHt. Which three-dimensional display 
appears to result in the best separation of the three breeds of bulls? 


Table 1.10 Data on Bulls 
Breed SalePr YrHgt FtFrBody PrctFFB Frame BkFat SaleHt SaleWt 
i 1 2200 51.0 1128 70.9 7 25 54.8 1720 
1 2250 51.9 1108 72.1 7 25 553 1575 
1 1625 49.9 1011 71.6 6 15 53.1 1410 
1 4600 53.1 993 68.9 8 35 56.4 1595 
1 2150 51.2 996 68.6 7 25 55.0 1488 
8 1450 51.4 997 73.4 7 10 55.2 1454 
8 1200 49.8 991 70.8 6 15 54.6 1475 
8 1425 50.0 928 708 6 10 53.9 1375 
8 1250 50.1 990 71.0 6 10 54.9 1564 
8 1500 517 992 70.6 7 15 55.1 1458 


Source: Data courtesy of Mark Ellersieck. 


1.27. Table 1.11 presents the 2005 attendance (millions) at the fifteen most visited national 
parks and their size (acres). 


(a) Create a scatter plot and calculate the correlation coefficient. 


References 47 


(b) Identify the park that is unusual. Drop this point and recalculate the correlation 
coefficient. Comment on the effect of this one point on correlation. 


(c) Would the correlation in Part b change if you measure size in square miles instead of 
acres? Explain. 


Table 1.11 Attendance and Size of National Parks 


National Park Size (acres) Visitors (millions) 
Arcadia 47.4 2.05 
Bruce Canyon 35.8 1,02 
Cuyahoga Valley 32.9 2.53 
Everglades 1508.5 1.23 
Grand Canyon 1217.4 4.40 
Grand Teton 310.0 2.46 
Great Smoky 521.8 9.19 
Hot Springs 5.6 1.34 
Olympic 922.7 3.14 
Mount Rainier 235.6 1.17 
Rocky Mountain 265.8 2.80 
Shenandoah ~ 199.0 1.09 
Yellowstone 2219.8 2.84 
Yosemite 761.3 3.30 
Zion 146.6 2.59 


References 


1, 


10. 


Becker, R. A., W. S. Cleveland, and A. R. Wilks. “Dynamic Graphics for Data Analysis.” 
Statistical Science, 2, no. 4 (1987), 355-395. 


. Benjamin, Y., and M. Igbaria. “Clustering Categories for Better Prediction of Computer 


Resources Utilization.” Applied Statistics, 40, no. 2 (1991), 295-307. 


. Capon, N., J. Farley, D. Lehman, and J. Hulbert. “Profiles of Product Innovators among 


Large U.S. Manufacturers.” Management Science, 38, no. 2 (1992), 157-169. 


. Chernoff, H. “Using Faces to Represent Points in K-Dimensional Space Graphically.” 


Journal of the American Statistical Association, 68, no. 342 (1973), 361-368. 


. Cochran, W. G. Sampling Techniques (31d ed.). New York: John Wiley, 1977. 
. Cochran, W. G., and G. M. Cox. Experimental Designs (2nd ed., paperback). New York: 


John Wiley, 1992. 


. Davis, J. C. “Information Contained in Sediment Size Analysis.” Mathematical Geology, 


2, no. 2 (1970), 105~—112. 


. Dawkins, B. “Multivariate Analysis of National Track Records.” The American Statisti- 


cian, 43, no. 2 (1989), 110-115. 


. Dudoit, S., J. Fridlyand, and T. P. Speed. “Comparison of Discrimination Methods for the 


Classification of Tumors Using Gene Expression Data.” Journal of the American Statisti- 
cal Association, 97, no, 457 (2002), 77-87. 


Dunham, R. B., and D. J. Kravetz. “Canonical Correjation Analysis in a Predictive System.” 
Journal of Experimental Education, 43, no. 4 (1975), 35-42. 


48 Chapter 1 Aspects of Multivariate Analysis 


11. 
12. 


13. 


14. 


15. 


16. 


17. 


18. 


19. 


20. 


21. 


22. 


23. 


24. 


25. 


26. 


27. 


28. 


29. 
30. 


31. 


32. 


Everitt, B. Graphical Techniques for Multivariate Data. New York: North-Holland, 1978. 
Gable, G. G. “A Multidimensional Model of Client Success when Engaging External 
Consultants.” Management Science, 42, no. 8 (1996) 1175-1198. 


Halinar, J. C. “Principal Component Analysis in Plant Breeding.” Unpublished report 
based on data collected by Dr. F. A. Bliss, University of Wisconsin, 1979. 


Johnson, R. A., and G. K. Bhattacharyya. Statistics: Principles and Methods (Sth ed.). 
New York: John Wiley, 2005. 

Kim, L., and Y. Kim. “Innovation in a Newly Industrializing Country: A Multiple 
Discriminant Analysis.” Management Science, 31, no. 3 (1985) 312-322. 

Klatzky, S. R., and R. W. Hodge. “A Canonical Correlation Analysis of Occupational 
Mobility.” Journal of the American Statistical Association, 66, no. 333 (1971), 16-22. 

Lee, J., “Relationships Between Properties of Pulp-Fibre and Paper.” Unpublished 
doctoral thesis, University of Toronto. Faculty of Forestry (1992). 

MacCrimmon, K., and D. Wehrung. “Characteristics of Risk Taking Executives.” 
Management Science, 36, no. 4 (1990), 422-435. 

Marriott, F H. C. Fhe Interpretation of Multiple Observations. London: Academic Press, 
1974. 

Mather, P. M. “Study of Factors Influencing Variation in Size Characteristics in Flu- 
vioglacial Sediments.” Mathematical Geology, 4, no. 3 (1972), 219-234. 

McLaughlin, M., et al. “Professional Mediators’ Judgments of Mediation Tactics: Multi- 
dimensional Scaling and Cluster Analysis.” Journal of Applied Psychology, 76, no. 3 
(1991), 465-473. 

Naik, D.N., and R. Khattree. “Revisiting Olympic Track Records: Some Practical Con- 
siderations in the Principal Component Analysis.” Fhe American Statistician, 50, no. 2 
(1996), 140-144. 

Nason, G. “Three-dimensional Projection Pursuit.” Applied Statistics, 44, no. 4 (1995), 
411-430. ° 

Smith, M., and R. Taffler. “Improving the Communication Function of Published 
Accounting Statements.” Accounting and Business Research, 14, no. 54 (1984), 139-146. 
Spenner, K. I. “From Generation to Generation: The Transmission of Occupation.” Ph.D. 
dissertation, University of Wisconsin, 1977. 

Tabakoff, B., et al. “Differences in Platelet Enzyme Activity between Alcoholics and 
Nonalcoholics.” New England Journal of Medicine, 318, no. 3 (1988), 134-139. 

Timm, N. H. Multivariate Analysis with Applications in Education and Psychology. 
Monterey, CA: Brooks/Cole, 1975. 

Trieschmann, J. S., and G. E. Pinches. “A Multivariate Model for Predicting Financially 
Distressed P-L Insurers.” Journal of Risk and Insurance, 40, no. 3 (1973), 327-338. 
Tukey, J. W. Exploratory Data Analysis. Reading, MA: Addison-Wesley, 1977. 

Wainer, H., and D. Thissen. “Graphical Data Analysis.” Annual Review of Psychology, 
32, (1981), 191-241. 

Wartzman, R. “Don’t Wave a Red Flag at the IRS.” The Wall Street Journal (February 24, 
1993), C1, C15. 

Weihs, C., and H. Schmidli. “OMEGA (On Line Multivariate Exploratory Graphical 
Analysis): Routine Searching for Structure.” Statistical Science, 5, no. 2 (1990), 175-226. 


Chapter 


MATRIX ALGEBRA 
AND RANDOM VECTORS 


2.1 Introduction 


We saw in Chapter 1 that multivariate data can be conveniently displayed as an 
array of numbers. In general, a rectangular array of numbers with, for instance, n 
rows and p columns is called a matrix of dimension n X p. The study of multivariate 
methods is greatly facilitated by the use of matrix algebra. 

The matrix algebra results presented in this chapter will enable us to concisely 
State statistical models. Moreover, the formal relations expressed in matrix terms 
are easily programmed on computers to allow the routine calculation of important 
Statistical quantities. 

We begin by introducing some very basic concepts that are essential to both our 
geometrical interpretations and algebraic explanations of subsequent statistical 
techniques. If you have not been previously exposed to the rudiments of matrix al- 
gebra, you may prefer to follow the brief refresher in the next section by the more 
detailed review provided in Supplement 2A. 


2.2 Some Basics of Matrix and Vector Algebra 


Vectors 
An array x of n real numbers x), x2,..., x, is called a vector, and it is written as 
x1 
x= ee or x’ = [x1,%2,---, Xp] 
Xn 


where the prime denotes the operation of transposing a column to a row. 


49 


50 Chapter 2 Matrix Algebra and Random Vectors 


Figure 2.1 The vector x’ = [1,3,2]. 


A vector x can be represented geometrically as a directed line in n dimensions 
with component x, along the first axis, x2 along the second axis, ...,and x, along the 
nth axis. This is illustrated in Figure 2.1 for n = 3. 

A vector can be expanded or contracted by multiplying it by a constant c. In 
particular, we define the vector cx as 


That is, cx is the vector obtained by multiplying each element of x by c. [See 
Figure 2.2(a).] 


2 
X2 + yy 
%2 
x 
¥2 
I | 
x My uty 
(b) 


Figure 2.2 Scalar multiplication and vector addition. 


Some Basics of Matrix and Vector Algebra 51 


Two vectors may be added. Addition of x and y is defined as 


x] yi uty 

x Xo + 
xt+y= Fa + = = Pay 

Xn Yn Xn + Yn 


so that x + y is the vector with ith element x; + y;. 

The sum of two vectors emanating from the origin is the diagonal of the paral- 
lelogram formed with the two original vectors as adjacent sides, This geometrical 
interpretation is illustrated in Figure 2.2(b). 

A vector has both direction and length. In n = 2 dimensions, we consider the 


vector 
| 
x = 
X2 


The length of x, written L,, is defined to be 
Ly = Vx2 + x3 


Geometrically, the length of a vector in two dimensions can be viewed as the 
hypotenuse of a right triangle. This is demonstrated schematically in Figure 2.3. 
The length of a vector x’ = [x;, x2,..., X,], with m components, is defined by 


Ly=Viitt ote (2-1) 
Multiplication of a vector x by a scalar c changes the length. From Equation (2-1), 
Le = VePx? + x} + + Px? 
= |cl[Vxe + xd te) + x2 = Je[Ly 


Multiplication by c does not change the direction of the vector x if c > 0. 
However, a negative value of c creates a vector with a direction opposite that of x. 
From 


Lex = |e|Ly (2-2) 


it is clear that x is expanded if |c| > 1 and contracted -if 0 < |c| < 1. [Recall 
Figure 2.2(a).] Choosing c = Ly!, we obtain the unit vector Ly!x, which has length 1 
and lies in the direction of x. 


a 
Figure 2.3 
zy Length of x = Vx7 + x3. 


Figure 2.4 The angle @ between 
x’ = [x,x2] andy’ = [y,, yz]. 


Asecond geometrical concept is angle. Consider two vectors in a plane and the 
angle 9 between them, as in Figure 2.4. From the figure, 6 can be represented as 
the difference between the angles 6, and 6, formed by the two vectors and the first 
coordinate axis. Since, by definition, 


cos (0;) = 2 cos (6) a # 
x y 

x. 
sin(@) = ie sin(6,) = 2 
y 


d 
- cos (0) = cos (05 = ,) = cos (82) cos (8;) + sin (0) sin (0;) 


the angle 6 between the two vectors x’ = [x), x2] and y’ = [y, y] is specified by 


nnn -(2)(2)- (C8) 2 0 
- y. x xfvy 


We find it convenient to introduce the inner product of two vectors, For n = 2 
dimensions, the inner product of x and y is 


xy = xy + YW 
With this definition and Equation (2-3), 


t 


xy _ x'y 
Iyly Vx'x Vy'y 


since cos(90°) = cos(270°) = 0 and cos(@) = 0 only if x'y = 0, x and y are 
erpendicular when x’y = 0. 
For an arbitrary number of dimensions n, we define the inner product of x 


and y as 


L, = Vx’x —_cos(@) = 


XY = xy + WW t+ + XY, (2-4) 


The inner product is denoted by either x’y or y’x. 


Some Basics of Matrix and Vector Algebra 53 


Using the inner product, we have the natural extension of length and angle to 
vectors of m components: 


L, = length ofx = Vx’x (2-5) 
x'y x’y 
cos(6) = = = (2-6) 
L,ly Vix'x Vy'y 
Since, again, cos(@) = 0 only if x’y = 0, we say that x and y are perpendicular 
when x’y = 0. 


Example 2.1 (Calculating lengths of vectors and the angie between them) Given the 
vectors x’ = [1,3,2] and y’ = [-2,1, —1], find 3x and x + y. Next, determine 
the length of x, the length of y, and the angle between x and y. Also, check that 
the length of 3x is three times the length of x. 


First, 
1 3 
3x =3)3/=|9 
2 6 
1 —2 1-2 —-1 
x+ty=/3/+ 1/=|/3+1/=] 4 
2 -1 2-1 1 


Next, x'x = 2 + 3? + 2?= 14, y’y = (-2)? + 2 + (-1)? =6, and x’y = 
1(~2) + 3(1) + 2(-1) = —1. Therefore, 
Ly = Vx'x = V14 = 3.742 Ly = Vy'y = V6 = 2.449 


and 
oS See ea! eee 
c0s(9) = Fr 37a) x 2449 


-.109 
so 6 = 96.3°. Finally, 
Ly = V3? +9 +67 = V126 and 3L, = 3V14 = V12%6 


showing L3, = 3L,. ) 


A pair of vectors x and y of the same dimension is said to be linearly dependent 
if there exist constants c, and c2, both not zero, such that 


cx + aQy= 0 
A set of vectors X1, X2,..., X; is said to be /inearly dependent if there exist constants 
Cy, C2,+--, Ck, not all zero, such that 
CyXy t+ CQXg to + oex, = 0 (2-7) 


Linear dependence implies that at least one vector in the set can be written as a 
linear combination of the other vectors, Vectors of the same dimension that are not 
linearly dependent are said to be linearly independent. 


54 Chapter 2 Matrix Algebra and Random Vectors 


Example 2.2 (Identifying linearly independent vectors) Consider the set of vectors 


1 1 1 
xy=)21 x= 0} x3=] -2 
= 1 ~1 1 
Setting 
cy Xy + C2X2 + €3x3 = 0 
implies that 


tat co =0 
2c, — 2c, = 0 
cy — 2 + c3=0 
with the unique solution c; = c, = c, = 0. As we cannot find three constants c; , c2, 


and c3, not all zero, such that c, x; + c, x2 + c3x3 = 0, the vectors x,, x2, and x3 are 
linearly independent. td] 


The projection (or shadow) of a vector x on a vector y is 


wy Gy) 1 


Projection of xon y = - 28 
j yy’ Ly 4,° (2-8) 
where the vector Ly'y has unit length. The length of the projection is 
Length of projection = [xy =L aw. = L,|cos(6)| (2-9) 
Le x | i x 


where @ is the angle between x and y. (See Figure 2.5.) 


ma ; 
xy 
G5) : 
|< 1, cos (@)-——>| Figure 2.5 The projection of x on y. 


Matrices 


A matrix is any rectangular array of rea] numbers. We denote an arbitrary array of 7 
rows and p columns by 


41) 412 *** Gp 
421 422 °"" agp 


Qn, Ang *** Anp 


Some Basics of Matrix and Vector Algebra 55 


Many of the vector concepts just introduced have direct generalizations to matrices. 

The transpose operation A’ of a matrix changes the columns into rows, so that 
the first column of A becomes the first row of A’, the second column becomes the 
second row, and so forth. 


Example 2.3 (The transpose of a matrix) If 
3 -1 2 
A = 
(2x3) ? 5 ‘| 


A’ =|-1 5 
(3x2) 


then 


A matrix may also be multiplied by a constant c. The product cA is the matrix 
that results from multiplying each element of A by c. Thus 


€Q;; €A;2 *** Cayp 

ca. ca “t* Ca 
cA = 21 22 : 2P 
(nX p) . . . 

CAni CAn2 *** CAnp 


Two matrices A and B of the same dimensions can be added. The sum A + B has 
(i, j)th entry a;; + b,j. 


Example 2.4 (The sum of two matrices and multiplication of a matrix by a constant) 


If 
0 31 1 -2 -3 
A = = 
(2x3) i -1 | end es b 5 i 
0 12 4 
4A = 
(x3) 2 -4 | 


ae, eee tee 3-2 1-3;_|]1 1 -2 
(2x3) (2x3) 1+2 -1+5 1+1 3 4 2 


It is also possible to define the multiplication of two matrices if the dimensions 
of the matrices conform in the following manner: When A is (n x k) and B is 
(k X p), so that the number of elements in a row of A is the same as the number of 
elements in a column of B, we can form the matrix product AB. An element of the 
new matrix AB is formed by taking the inner product of each row of A with each 
column of B. 


then 


56 Chapter 2 Matrix Algebra and Random Vectors 


The matrix product AB is 


= the(n X p) matrix whose entry in the ith row 
and jth column is the inner product of the ith row 
of A and the jth column of B 


A B 
(axk)(kXp) 


or a 


k 
(i, j) entry of AB = jb; + a,b; af esis ajKDKj = >, aie be; (2-10) 
€=1 


When k = 4, we have four products to add for each entry in the matrix AB. Thus, 


411 412 G13 14g 
: : 5 : bu Dip 
= bz1 bz p 

A B =/G@i 42 43 44 
(nx4)(4xp) a b3, b3p 
: : : : a é 
Pp 

Ani 8n2 Anz Ang 

Column 


J 


Rowi ++ (aj1b1; + ajb2; + aj3b3; + ai;qbaj) °° 


Example 2.5 (Matrix multiplication) If 


~2 
3 -1 2 2 0 
a-|} P ‘| B=| 7|, and c=[? 4 


9 
then 
Pe eee "e _ [3(-2) + (-1)(7) + 209) 
(2x3)(3X1) 1 5 4 9 1(-2) + 5(7) + 4(9) 
als 
~ | 69 
(2x1) 
and 
2 O}];3 -1 2 
eae 1 a ? 5 | 


- 2(3) + 0(1) 2(-1) + 0(5) 2(2) + 0(4) 
~ [1(3) — 11) 1(-1) - 15) 1(2) — 1(4) 


Some Basics of Matrix and Vector Algebra 57 


When a matrix B consists of a single column, it is customary to use the lower- 
case b vector notation. 


Example 2.6 (Some typical products and their dimensions) Let 


The product A b is a vector with dimension equal to the number of rows of A. 
5 
b’c = [7 -3 6]] 8] = [-13] 
—4 


The product b’c isa X 1 vector or a single number, here —13. 


7 35 56-28 
be’ =| -3|[5 8 -4]=| -15 -24 12 
6 30 48 -24 


The product be’ is a matrix whose row dimension equals the dimension of b and 
whose column dimension equals that of ¢. This product is unlike b’c, which is a 
single number. 


7 
d’Ab = [2 aif} i. =| ~3 | = [26] 
6 


The product d'Ab is a1 X 1 vector or a single number, here 26. = 


Square matrices will be of special importance in our development of statistical 
methods, A square matrix is said to be symmetric if A = A’ or a;; = a;; for all i 
and j. 


58 Chapter 2 Matrix Algebra and Random Vectors 


Example 2.7 (A symmetric matrix) The matrix 


4] 


is symmetric; the matrix 


is not symmetric. , = 


When two square matrices A and B are of the same dimension, both products 
AB and BA are defined, although they need not be equal. (See Supplement 2A.) 
If we let I denote the square matrix with ones on the diagonal and zeros elsewhere, 
it follows from the definition of matrix multiplication that the (i, /)th entry of 
AI is aj; X 0+ +++ + aj. X 0+ ay X14 aj ja) XO +--+ + aj, X 0 = a;j, SO 
AI = A. Similarly,IA = A, so 


I A = A I = A forany A (2-11) 
(KXA)(KXK) — (KXAY AXA) (AX) (kxk) 


The matrix I acts like 1 in ordinary multiplication (1-a = a-1-= a), so it is 
called the identity matrix. 
The fundamental scalar relation about the existence of an inverse number a7! 


such that a!@ =-aa = 1 if a # 0 has the following matrix algebra extension: If 
there exists a matrix B such that 


B A= A B= I 
(AXK)(KXK) — (KXK)(KXK) — (KXK) 


then B is called the inverse of A and is denoted by A? 
The technical condition that an inverse exists is that the k columns a;, a2,..., a 
of A are linearly independent. That is, the existence of A7! is equivalent to 


Cyay + CoM. +°-° + Ga, =O onlyif q=--- =cq =0 (2-12) 


(See Result 2A.9 in Supplement 2A.) 


Example 2.8 (The existence of a matrix inverse) For 


3 2 
sae 
you may verify that 
-2 Al[3 2]_ | (-.2)34+(4)4 (-2)2+ (4)1 
8 -6||4 1] | (.8)3+(-6)4 (.8)2 + (-.6)1 


“Lo 


Some Basics of Matrix and Vector Algebra 59 


so 


is A). We note that 


ofa}eo[i}-[ 


implies that c) = c, = 0, so the columns of A are linearly independent. This 
confirms the condition stated in (2-12). = 


A method for computing an inverse, when one exists, is given in Supplement 2A. 
The routine, but lengthy, calculations are usually relegated to a computer, especially 
when the dimension is greater than three. Even so, you must be forewarned that if 
the column sum in (2-12) is nearly 0 for some constants c,,..., c,, then the computer 
may produce incorrect inverses due to extreme errors in rounding. It is always good 
to check the products AA” and AA for equality with I when A! is produced by a 
computer package. (See Exercise 2.10.) 

Diagonal matrices have inverses that are easy to compute. For example, 


ot 0 0 0 0 
41 
a, O 0 0 0 0 = 0 0 0 
0 a, 0 0 0 a ‘ 
0 O a3 0 O has inverse 0 0 a Oo oO 
0 @ 0 -ay 0 . i 
0 303: Oo ce 2ass 0 0 0 — 0 
a44 
0 0 0 = 
455 


if all the a;; # 0. 
Another special class of square matrices with which we shall become familiar 
are the orthogonal matrices, characterized by 


QQ’ = Q’Q=I or Q=Q" (2-13) 


The name derives from the property that if Q has ith row q}, then QQ’ = I implies 
that q;q; = 1 and qjq; = 0 fori # j, so the rows have unit length and are mutually 
perpendicular (orthogonal). According to the condition Q’Q = I, the columns have 
the same property. 

We conclude our brief introduction to the elements of matrix algebra by intro- 
ducing a concept fundamental to multivariate statistical analysis. A square matrix A 
is said to have an eigenvalue A, with corresponding eigenvector x # 0, if 


Ax = Ax (2-14) 


60 © 


papter 2 Matrix Algebra and Random Vectors 


Ordinarily, we normalize x so that it has length unity; that is, 1 = x’x. It is 
convenient to denote normalized eigenvectors by e, and we do so in what follows. 
Sparing you the details of the derivation (see [1]), we state the following basic result: 


Let A be ak X k square symmetric matrix. Then A has k pairs of eigenvalues 
and eigenvectors namely, 


Ay; e) Az, €2 wes AK, ex (2-15) 
The eigenvectors can be chosen to satisfy 1 = eje; = --- = e;e, and be mutually 


perpendicular. The eigenvectors: are unique unless two or more eigenvalues 
are equal. 


Example 2.9 (Verifying eigenvalues and eigenvectors) Let 


ae 


Then, since 


is its corresponding normalized eigenvector. You may wish to show that a second 
eigenvalue-eigenvector pair is Ay = —4, ef = [1/V2,1/V2]. re) 


A method for calculating the A’s and e’s is described in Supplement 2A. It is in- 
structive to doa few sample calculations to understand the technique. We usually rely 
on acomputer when the dimension of the square matrix is greater than two or three. 


2.3 Positive Definite Matrices 


The study of the variation and interrelationships in multivariate data is often based 
upon distances and the assumption that the data are multivariate normally distributed. 
Squared distances (see Chapter 1) and the multivariate normal density can be 
expressed in terms of matrix products called quadratic forms (see Chapter 4). 
Consequently, it should not be surprising that quadratic forms play a central role in 


Positive Definite Matrices 61 


multivariate analysis. In this section, we consider quadratic forms that are always 
nonnegative and the associated positive definite matrices. 

Results involving quadratic forms and symmetric matrices are, in many cases, 
a direct consequence of an expansion for symmetric matrices known as the 
spectral decomposition. The spectral decomposition of a k x k symmetric matrix 
A is given by! 


A =dAy ey ef +Az ep Of Freer HALO, &| (2-16) 
(kxk) (kx1)(1xk) (kx1)(1Xk) (kx1)(1Xk) 
where A,, Az,..., Ax are the eigenvalues of A and ej, €),...,e, are the associated 


normalized eigenvectors. (See also Result 2A.14 in Supplement 2A). Thus, eje; = 1 
fori = 1,2,...,k, and eje; = 0 fori # j. 


Example 2.10 (The spectral decomposition of a matrix) Consider the symmetric matrix 


13-4 2 
A=|-4 13 2 
2 -2 10 


The eigenvalues obtained from the characteristic equation | A ~ AI| = 0 are 
A, = 9, Az = 9, and A3 = 18 (Definition 2A.30). The corresponding eigenvectors 
€,,@2, and e3 are the (normalized) solutions of the equations Ae; = Aj,e; for 
i = 1,2,3. Thus, Ae; = Ae, gives 


13. -—4 2 ei ey) 
-—4 13. -2 €21 =9 21 
2-522 10 31 €31 


or 
13e;; — 4@2; + 26€3) = 9e,) 
—4e,, + 13e,3 — 26€3, = 9e2) 
2€1; — 22; + 10e3; = 9e3; 


Moving the terms on the right of the equals sign to the left yields three homogeneous 
equations in three unknowns, but two of the equations are redundant. Selecting one of 
the equations and arbitrarily setting e;; = 1 and e,, = 1, we find that e,; = 0. Con- 
sequently, the normalized eigenvector is ej = [1/V 22 + 12 + 07, 1/V FP + FP + 02, 

P+ 2 + 0?) ~[1/V2,1/V2,0], since the sum of the squares of its elements 
is unity. You may verify that e, = [1/V18, —1/V18, —4//18] is also an eigenvector 
for 9 = Az, and e3 = [2/3, —2/3, 1/3] is the normalized eigenvector corresponding 
to the eigenvalue A; = 18. Moreover, eje; = 0 fori ¥ j. 


1A proof of Equation (2-16) is beyond the scope of this book. The interested reader will find a proof 
in [6], Chapter 8. 


62 Chapter2 Matrix Algebra and Random Vectors 


The spectral decomposition of A is then 


A= Ayeye} + A2€2€2 + A3€3€3 


or 
i 
13 -4 24 v2 ue 4 
4 3 -2(=9 1 |[ J ol 
2 2 10 ile NA 
0 
a 2 
VI8 3 ; 
Og aes.) |r cs ee ede eae 
Vis || Vig vis vis 3113 3 3 
es 1 
18 3 
1 1 4 
LG 18 18 18 
te 1 1 4 
= +9) -— _-_ = 
"50 1s is 8 
000 se ue ae 
18 18 18 
ON Ser oe 
9 9 9 
4 4 2 
aad aaa? 9° «9 
2 2 1 
9 9 9 


as you may readily verify. 


The spectral decomposition is an important analytical tool. With it, we are very 
easily able to demonstrate certain statistical results. The first of these is a matrix 
explanation of distance, which we now develop. 

Because x’A x has only squared terms x? and product terms x;x,, it is called a 
quadratic form. When ak X k symmetric matrix A is such that 


0s x’Ax (2-17) 
for all x’ = [x,, x2,..., x,], both the matrix A and the quadratic form are said to be 
nonnegative definite. If equality holds in (2-17) only for the vector x’ = [0,0,...,0], 


then A or the quadratic form is said to be positive definite. In other words, A is 
positive definite if 


0<x’Ax (2-18) 


for all vectors x # 0. 


Positive Definite Matrices 63 


Example 2.11 (A positive definite matrix and quadratic form) Show that the matrix 
for the following quadratic form is positive definite: 
3x3 + 2x3 —2V2 xx, ‘ 


To illustrate the general approach, we first write the quadratic form in matrix 


notation as 
3 -V2]) x . 
[x1 alg : i =x’Ax 


By Definition 2A.30, the eigenvalues of A are the solutions of the equation 
| A — AI] = 0, or (3 — A)(2 — A) — 2 = 0. The solutions are A, = 4 and A; = 1. 
Using the spectral decomposition in (2-16), we can write 


A = Aye; ej + A2e2 e} 
(2x2) (2X1)(1X2) (2X1)(1x2) 


= fe; ej + & @& 
(2x1)(1X2)  (2x1)(1x2) 
where e, and e, are the normalized and orthogonal eigenvectors associated with the 
eigenvalues A, = 4 and Az = 1, respectively. Because 4 and 1 are scalars, premulti- 
plication and postmultiplication of A by x’ and x, respectively, where x’ = [xj, x2] is 
any nonzero vector, give 
x’ x = 4x’ e, ej x + x’ @& e@& x 
(12)(2x2)(2x1)  (1X2)(2%1)(1%2)(2X1)—(12)(2%1)(1%2) (2X1) 
= 4yi + yh 20 
with 
yy = x’e; = e]X and yp = x’e, = ex 
We now show that y and y» are not both zero and, consequently, that 


x'Ax = 4yj + y3 > 0, or A is positive definite. 
From the definitions of y, and », we have 


BJT] 


y = E x 
(2x1) (2X2)(21) 


or 


Now E is an orthogonal matrix and hence has inverse E’. Thus, x = E’y. But x isa 
nonzero vector, and 0 # x = E’y implies that y # 0. ] 


Using the spectral decomposition, we can easily show that a k x k symmetric 
matrix A is a positive definite matrix if and only if every eigenvalue of A is positive. 
(See Exercise 2.17.) A is a nonnegative definite matrix if and only if all of its eigen- 
values are greater than or equal to zero. 

Assume for the moment that the p elements x1, x2,...,x, of a vector x are 
realizations of p random variables X;, X2,-.., X,- As we pointed out in Chapter 1, 


64 ChaP 


ter 2 Matrix Algebra and Random Vectors 


we can regard these elements as the coordinates of a point in p-dimensional space, 
and the “distance” of the point [x,,x2,...,Xp|’ to the origin can, and in this case 
should, be interpreted in terms of standard deviation units. In this way, we can 
account for the inherent uncertainty (variability) in the observations. Points with the 
same associated “uncertainty” are regarded as being at the same distance from 
the origin. ee 

If we use the distance formula introduced in Chapter 1 [see Equation (1-22)], 
the distance from the origin satisfies the general formula 


: } 2 2 
(distance )* = 4X4 + Gy2X3 + °° + Appx 


+ 2(a Says 
2(@y2x4XQ + 4 13X1x3 + +++ + Ap_1 pXp-1Xp) 


provided that (distance)? > Oforall [x1, x2,--.,xp] # [0,0,..., 0]. Setting a,; = aj;, 
i#j,i=1,2,...,.p, f= 12-57 we have 


M4, 42° Ap}! xX 
: 4, @ a x 
0 < (distance)? = [*1,x2,--- Xp] a 2p 2 
[p1 4p2 *** Appl LXp 
or 
0 < (distance)? = x’Ax forx #0 (2-19) 


From (2-19), we see that the p X p symmetric matrix A is positive definite. In 
sum, distance is determined from a positive definite quadratic form x’Ax. Con- 
versely, a positive definite quadratic form can be interpreted as a squared distance. 


Comment. Let the square of the distance from the point x’ = [x), x2,..., Xp] 
to the origin be given by x'Ax, where A isa p X p symmetric positive definite 
matrix. Then the square of the distance from x to an arbitrary fixed point 
w= [1, Has -++s Mp] is given by the general expression (x — )’A(x — p). 


Expressing distance as the square root of a positive definite quadratic form al- 
lows us to give a geometrical interpretation based on the eigenvalues and eigenvec- 
tors of the matrix A. For example, suppose p = 2. Then the points x’ = [x,, x2] of 
constant distance c from the origin satisfy 


XAX = Oyj X4 + 49X53 + 2022 x2 = 


By the spectral decomposition, as in Example 2.11, 
A = Ajeye; + Are2@2 SO XAX= Ai(x'e,)” + Ao(x’e)” 


Now, 2 = A,y? + Apy3 is an ellipse in yy = x’e, and y = x’e, because Aj, A, > 0 
when A is positive definite. (See Exercise 2.17.) We easily verify that x = c, Aye, 
satisfies x’ Ax = Ay(cAz!e{e1)” = ¢, Similarly, x = cAz’e, gives the appropriate 
distance in the e, direction. Thus, the points at distance c lie on an ellipse whose axes 
are given by the eigenvectors of A with lengths proportional to the reciprocals of 
the square roots of the eigenvalues. The constant of proportionality is c. The situa- 
tion is illustrated in Figure 2.6. 


A Square-Root Matrix 65 


N 


© 


e 


ee | 


Figure 2.6 Points a 
constant distance c 
from the origin 
(p = 2,1 SA, < Ad). 


If p > 2, the points x’ = [x , x2,...,x,] aconstant distance c = Vx'Ax from 
the origin lie on hyperellipsoids c? = Ai(x'e1)" tenet Ap(x'ep), whose axes are 
given by the eigenvectors of A. The half-length in the direction e; is equal to c/ Vain 
i= 1,2,..., p, where Aj, Ap,..., Ap are the eigenvalues of A. 


2.4 A Square-Root Matrix 


The spectral decomposition allows us to express the inverse of a square matrix in 
terms of its eigenvalues and eigenvectors, and this leads to a useful square-root 
matrix, : 

Let A beak x k positive definite matrix with the spectral decomposition 


k 

A= > A,e,e}. Let the normalized eigenvectors be the columns of another matrix 
i=1 

P = [e;, e2,..., e,]. Then 


k 
A=S Aa ee = P A P’ (2-20) 
(kxk)} i=1 = (kx1)(1xk) (Axk)}(kxk) (kx) 


where PP’ = P’P = [and A is the diagonal matrix 


A, O + O 
0 A 
Slice 2 with A; > 0 
(kxk) : 


0 Ot: A& 


66 Chapter2 Matrix Algebra and Random Vectors 


Thus, 


k 
A‘ =PA!p’ = Dre ee! (2-21) 
f=1 


since (PA! P’)PAP’ = PAP'(PA™'P’) = PP’ =I. 
Next, let AM denote the diagonal matrix with VA as the ith diagonal element. 
The matrix ¥ Vi; ee; = PAP’ is called the square root of A and is denoted by 
i=l 
AY?, 


The square-root matrix, of a positive definite matrix A, 
k 
Al? = S Vi; ee} = PAP’ (2-22) 
i=] 


has the following properties: 
1, (A!?)' = A’? (that is, A’? is symmetric). 
2. AYA? = A. 


As 


(ai?) - SJ. i ee} = PA~'P’, where A~¥” is a diagonal matrix with 
1/VA; as the ith diagonal element. 
4. AAV? = AVA? = I and AVA"? = A, where AY? = (Al?) 7. 


2.5 Random Vectors and Matrices 


A random vector is a vector whose elements are random variables. Similarly, a 
random matrix is a matrix whose elements are random variables. The expected value 
of a random matrix (or vector) is the matrix (vector) consisting of the expected 
values of each of its elements. Specifically, let X = {X;;} be an n X p random 
matrix. Then the expected value of X, denoted by E(X), is the n X p matrix of 
numbers (if they exist) 


E(Xy1) E(Xi2) +--+ E(X1p) 


E(Xo1) E(X22) a E(X2p) 


E(X) = (2-23) 


E(Xm) E(Xq2) - E(Xnp) 


Random Vectors and Matrices 67 


where, for each element of the matrix,” 


fee] 
| Xi fij( ij) Wiz if X;; is a continuous random variable with 
00 probability density function f; ;(x;;) 
E(Xij) = 
if X;;is a discrete random variable with 
> Xi jPi;(%is) probability function p; (x;;) 


all x;; 


Example 2.12 (Computing expected values for discrete random variables) Suppose 
p =2 and n = 1, and consider the random vector X’ = [X,, X2]. Let the discrete 
random variable Xj have the following probability function: 


Xj -1 0 1 
P(X) 3 3 4 


Then £(X,) = 3) xy 71(%1) = (—1) (3) + (0) (3) + (1) (4) = 1. 


all x, 


Similarly, let the discrete random variable X, have the probability function 


X2 0 1 
pxx%2) | 8 2 


Then E(X2) = > x2p,(x2) = (0)(.8) + (1) (.2) = .2. 


all x2 
al BH 
E(X) = = 

x Be 2 . 
Two results involving the expectation of sums and products of matrices follow 
directly from the definition of the expected value of a random matrix and the univariate 
properties of expectation, E(X, + Yj) = E(X;) + E(%) and E(cX,) = cE(X). 
Let X and Y be random matrices of the same dimension, and let A and B be 

conformable matrices of constants. Then (see Exercise 2.40) 


Thus, 


E(X + ¥) = E(X) + E(Y) (2-24) 
E(AXB) = AE(X)B 


21 you are unfamiliar with calculus, you should concentrate on the interpretation of the expected 
value and, eventually, variance. Our development is based primarily on the properties of expectation 
rather than its particular evaluation for continuous or discrete random variables. 


68 Chapter 2 Matrix Algebra and Random Vectors 


2.6 Mean Vectors and Covariance Matrices 


Suppose X' = [X,, X2,..-, Xp] isa p X 1 random vector. Then each element of X isa 
random variable with its own marginal probability distribution. (See Example 2.12.) The 
marginal means ,1; and variances o? are defined as p; = E(X;) ando? = E(X; — 4), 
i= 1,2,..., p, respectively. Specifically, 
| x; fix; dx; if X;is a continuous random variable with probability 
—00 density function f;(x;) 


hi = : 
if X,is a discrete random variable with probability 
SS x;Pi( X;) function p;(x;) 
P 
all x; 
/ (x; — w)*fi(x) dx; if X; is a continuous random variable (2-25) 
00 with probability density function f;(x;) 
2 
oj; = 


; if X; is a discrete random variable 
py (x; — Hi)” Pi(%s) with probability function p;(x;) 
all x; 


It will be convenient in later sections to denote the marginal variances by o;;; rather 
than the more traditional o?, and consequently, we shall adopt this notation. 

The behavior of any pair of random variables, such as X; and X;, is described by 
their joint probability function, and a measure of the linear association between 
them is provided by the covariance 


Oj, = E(X; — wi) (Xk — we) 


foe} foe} 
i / (x; — Mi) (~~ Me) fix %;, xx)dx; dx, if X;, X;, are continuous 
Se random variables with 
the joint density 
function fj4(%;, x4) 


By (4; — Mi) (Xk — Bee) PiK( Xi XK) if X;, X; are discrete 

all x; all x, random variables with 
joint probability 
function p;,(X;, X,) 


(2-26) 


and yw; and p,, i,k = 1,2,..., p, are the marginal means. When i = k, the covari- 
ance becomes the marginal variance. 

More generally, the collective behavior of the prandom variables X,, X2,..., Xp 
or, equivalently, the random vector X’ = [X1, X2,..., Xp], is described by a joint 
probability density function f(x, x2,...,x,) = f(x). As we have already noted in 
this book, f (x) will often be the multivariate normal density function. (See Chapter 4.) 

If the joint probability P[X; = x;and X, = x,]can be written as the product of 
the corresponding marginal probabilities, so that 


P[X; S xjand X;, S x4] = P[X; = x)JP[X, = xy] (2-27) 


Mean Vectors and Covariance Matrices 69 


for all pairs of values x;, x,, then X; and X; are said to be statistically independent. 
When X; and X; are continuous random variables with joint density f,,(x,;, x,) and 
marginal densities f,(x;) and f,(x,), the independence condition becomes 


Siri, Xe) = filed f(x) 


for all pairs (x;, x,)- 
The p continuous random variables X1, X>,...,Xp) are mutually statistically 
independent if their joint density can be factored as 


fi2--p(, XQ,-0+5 Xp) = fil%i)fr(x2) -- -fp(Xp) (2-28) 


for all p-tuples (x1, x2,...,Xp). 
Statistical independence has an important implication for covariance. The 
factorization in (2-28) implies that Cov (X;, X,) = 0. Thus, 


Cov (X;, X,) = 0 if X; and X;, are independent (2-29) 


The converse of (2-29) is not true in general; there are situations where 
Cov(X;, X;) = 0, but X; and X;, are not independent. (See [5].) 

The means and covariances of the p X 1 random vector X can be set out as 
matrices. The expected value of each element is contained in the vector of means 
» = E(X), and the p variances o;; and the p(p — 1)/2 distinct covariances 
oix(i<k) are contained in the symmetric variance-covariance matrix 
x = E(X — w)(X — pw)’. Specifically, 


E(X1) My 
a(x) =|") |_| lay (2-30) 
E(Xp) Hp 
and 
y= E(X — p)(X — p)’ 
X)- My 
=E Ma — Me [X1 — 11.X2 — pas... Xp — Mpl 
Xp ~ Mp 
(X11 — 1) (X1 ~ 1) (Xo - wa) oo (Xo) (Xp - Hp) 
=E (X2 - ba) (X — &) (X2 i. M2) a (X2 - H2)(Xp — Mp) 
(Xp ~ Mp)(X1 ~ a) (Xp ~ Mp) (Xa ~ oa) (Xp — Hp)” 
E(Xy — 4) E(X, ~ oy) (Xo ~ oe) +> E(X1 — oy) (Xp — Mp) 
E(X — #2) (X1 — 7) E(X2 — #2)? et E(X - M2) (Xp — fp) 


E(Xp = Bp) (X1 = Hy) E(X, = bp) (Xz _ #2) Sie E(X> Ses by)? 


70 Chapter2 Matrix Algebra and Random Vectors 


or 
O11 O12 "** O1p 
o 07 °° CF 
PSCov(Xpa [Ph °F (2-31) 
4 Opi Op2 “* Opp 


Example 2.13 (Computing the covariance matrix) Find the covariance matrix for 
the two random variables X, and X, introduced in Example 2.12 when their joint 
probability function p,2(x,, x2) is represented by the entries in the body of the 
following table: 


We have already shown that uw, = E(X)) = .1 and uw, = E(X2) = .2. (See Exam- 
ple 2.12.) In addition, 


ou = E(X — my = > (4 - 1) pi(x11) 


all xy 


= (-1 — .1)°(.3) + (0 — .1)°(.3) + (1- 1)P°(4) = 69 


O22 = E(Xy — by)? = > (x2 — 2)? po(x2) 


all x2 
= (0 — .2)*(.8) + (1 — .2)?(.2) 
= 16 


E(X, — 21)(X2 — w2) = DS (4 ~ 1) (2 — 2) py 241, x2) 


a] pairs (x1, x2) 


012 
= (-1 — .1)(0 — .2)(.24) + (-1 — .1)(1 — .2) (.06) 
+--+ (1 — .1)(1 — .2)(.00) = —.08 


02, = E(X2 -— we) (X1 — ey) = E(X, — 1) (X2 — we) = O12 = —.08 


Mean Vectors and Covariance Matrices 71 


Consequently, with X’ = Xj, X3], 
> a E(%) _|E = A 
BEY S Bel : A : H 


x = E(X— w)(X- pw)’ 


and 


= is — 14)? (X1 -— W1)(X —- | 
(X2 — wa)(X1 -— m1) (% - ba) 


Ss) Re — #4)? E(X, — 4) (X2 - | 
E(X2 — b2)(X1 — m1) E(X2 — mp)? 


~ — | F111 O12] _ 69 —.08 
021; O22 —.08 16 rl 
We note that the computation of means, variances, and covariances for discrete 
random variables involves summation (as in Examples 2.12 and 2.13), while analo- 
gous computations for continuous random variables involve integration. 


Because oj, = E(X; — wi)(Xk — Mx) = Oxi, it is convenient to write the 
matrix appearing in (2-31) as 


O11 O12 *"* O1p 
S = E(X — p(X —p) =|? 922 Oe (2-32) 
Clip C2p *** Opp 


We shall refer to 4 and = as the population mean (vector) and population 
variance-covariance (matrix), respectively. 

The multivariate normal distribution is completely specified once the mean 
vector # and variance-covariance matrix & are given (see Chapter 4), so it is not 
surprising that these quantities play an important role in many multivariate 
procedures. 

It is frequently informative to separate the information contained in vari- 
ances o;; from that contained in measures of association and, in particular, the 
measure of association known as the population correlation coefficient p;,. The 
correlation coefficient p,, is defined in terms of the covariance o;, and variances 
oj, and oj, as 


Cik 
Soe eee a 9-33 
Pik : oi; . Sag ( ) 


The correlation coefficient measures the amount of /inear association between the 
random variables X; and X,. (See, for example, [5].) 


72 Chapter 2 Matrix Algebra and Random Vectors 


Let the population correlation matrix be the p X p Symmetric matrix 


on 0712 
VO, V1, Voy, VO22 
912 922 


es Oe. 
VO}11 VO, 


2p 


.~) 
Ht 


V0\1V022  V022 Vo22 


1p 2D 


VO711 VO, p W022 Vopp 


1 pz °° Pip 
_| p12 1 <p 
Pip P2p °"* 1 


and let the p X p standard deviation matrix be 


Voi, 0 
vin 0 Vio22 
0 0 


Then it is easily verified (see Exercise 2.23) that 


vi? pv? =z 


and 


p= (vi7ytx(vi) 


v 922 VO pp 


Opp 


Von» VO pp 


(2-34) 
0 
: (2-35) 
Vo pp 
(2-36) 
(2-37) 


That is, & can be obtained from WV’ and p, whereas p can be obtained from &. 
Moreover, the expression of these relationships in terms of matrix operations allows 
the calculations to be conveniently implemented on a computer. 


Example 2.14 (Computing the correlation matrix from the covariance matrix) 


Suppose 
4 1 2 O11, O12 O13 
Z=/]1 9 -3)=] 032 022 023 
2 3 25 013° 023 033 


Obtain V1/? and p. 


Mean Vectors and Covariance Matrices. 73 


Here 
VOr1 0 0 2 0 0 
vi = 0 Vo. O |=/|0 3 0 
0 0 V033 0 0 5 
and 
4 00 
(vi2)7=/0 4 0 
00 3 


Consequently, from (2-37), the correlation matrix p is given by 


10 o][4 1 27[} 0 0 
w2ytx~v'2)4+=/0 4 off1 9 -3]/0 4 0 
00 ¢}L2 -3 2]Lo o } 


i 


UI Aim ee 


Partitioning the Covariance Matrix 


Often, the characteristics measured on individual trials will fall naturally into two 
or more groups. As examples, consider measurements of variables representing 
consumption and income or variables representing personality traits and physical 
characteristics. One approach to handling these situations is to let the character- 
istics defining the distinct groups be subsets of the fofal collection of characteris- 
tics. If the total collection is represented by a (p X 1)-dimensional random 
vector X, the subsets can be regarded as components of X and can be sorted by 
partitioning X. 

In general, we can partition the p characteristics contained in the p X 1 random 
vector X into, for instance, two groups of size g and p — q, respectively. For exam- 
ple, we can write 


Xy My 
: q 
tle Ae . fxm Hq PO 
> a a E (| and mw = E(X) = joo = Fe 
Xo+1 Mq+1 
: P-4q 
Xp Lp 


1A Chapter 2 Matrix Algebra and Random Vectors 


From the definitions of the transpose and matrix multiplication, 


(x = pb) (x) = py’ 


Xi - wy 
sf gee pl hea ig aatindy = Hel 
Xy - be 
(X1 — pr) (Xqa1 — Bast) (% e Ha) (Xqa2 ~ Mann) 0 (Xa — Ba) (Xp — Bp) 
Z (X2 — H2)(Xars — Hgs1) (2 - B2)(Xqr2 — bg+2) ae (X - H2)(Xp — Bp) 
(Xq  thq)(Xgar — thgar) (Ky ~ stg) Xgaa — Masa) <-> (Xa — tg) Xp — by) 


Upon taking the expectation of the matrix (X® - p)) (x@) — 2?))’, we get 


P1941 F1,q+2 °° Op 
, GC: a OG 
E(X) ~ pl) (x2) — pe) = | Cat 2942 P27 | (2.39) 
Faqt1 %,g+2 “** Cap 


which gives all the covariances,o,;,i = 1,2,....9¢,/= 4+ 1,q + 2,..., p, between 
a component of X“) and a component of X?). Note that the matrix Z> is not 
necessarily symmetric or even square. 

Making use of the partitioning in Equation (2-38), we can easily demonstrate that 


(X ~ n)(K ~ 2) 


KO) = pM) (KO ~ pOMy OK) — AC) (x2) — (2d) 
( (af d( axel” y 6 (9x1 ) aaxie- hy? 


(x? — 62) (KO py! (X) = po) (xX — ply 


((p-4)x1) ¢ ((p-4)X1) (1X(p—g}) 
and consequently, 
q P-4 
a) 21} X12 
X= EK w(K - py = 8 | hae. 
(pXp) ( p-q| Za. | Za 
(px p) 
O71 O19 i Fig+i O1p 
Gq qq} %qq41 Cap 
TEP vec enec ences pera ceneenen en censennen + wrneet nnn eeeenreeneenenen esc eenence (2-40) 
Og+ii *'° Fg4hq 1 g+1.q+1 “' Ogatp 
Fpl pq 'F%pgt1 “"' Opp 


Mean Vectors and Covariance Matrices 75 


Note that £1) = &4,. The covariance matrix of X“ is ¥,,, that of X) is X2, and 

that of elements from X“) and X) is E19 (or X2 1: 

It is sometimes convenient to use the Cov (X"), X@)) notation where 
Cov (XK) = X15 


is a matrix containing all of the covariances between a component of X‘!) and a 
component of X®). 


The Mean Vector and Covariance Matrix 
for Linear Combinations of Random Variables 


Recall that if a single random variable, such as Xj, is multiplied by a constant c, then 
E(cX,) = cE(X,) = cu, 
and 
Var (cX,) = E(cX, - cyy)* = cVar(X,) = co, 


If X, is a second random variable and a and b are constants, then, using additional 
properties of expectation, we get 


Cov (aX,bX2) = E(aX, — apy)(bX2 — by) 
= abE(X, — wy)(X2 — wp) 
= abCov(X,, X2) = abou 

Finally, for the linear combination aX, + bX , we have 


E(aX, + bX) = aE(X) + bE(Xz) = ap, + buy 
Var (aX, + bX2) = E[(aX, + bX) — (au, + buy)? 
1 2 1 = 1 ] 


= Ela(Xy — py) + b(X2 — we)? 

E|a?(X, — wy)? + b?(X2 — pa)? + 2ab(Xy — 1) (X2 - #2)] 
a’Var(X,) + b?Var(X) + 2abCov(X}, X2) 

a’o04, + b*022 + 2aboy2 (2-41) 


I 


With c’ = [a,b],aX, + bX, can be written as 


If we let 


Oy, O12 
> ara a 
O12, 922 


76 Chapter 2 Matrix Algebra and Random Vectors 


be the variance—-covariance matrix of X, Equation (2-41) becomes 
Var (aX, + bX) = Var(e'X) = c'Xe (2-42) 
since 
c’X<e = [a 5] ie e | E = a’o1, + 2abo,, + b’o22 
T12 G22} Lb 


The preceding results can be extended to a linear combination of p random variables: 


The linear combination ¢’X = c,X; + --- + c,X, has 
mean = E(e’X) = c’p : 
variance = Var(c’X) = e’Xe (2-43) 
where pw = E(X) and > = Cov(X). 


In general, consider the q linear combinations of the p random variables 
Xj,...,X,: 


oi 
2; = C41 X4 + €12X2 tenet Cy pXy 
Zz = €2)Xy + €22Xq +++ + | yXp 
Lg = CqiXy + CgaXq + +++ + CgpXp 
or 
2, C1y Cia tt Cap | | XY 
Z c C22 **' C25 x 
g-| 2 [alee 1% lic ay 
Zq €q1 €q2 *** gp Xp 
(9X1) (qxp) (px1) 


The linear combinations Z = CX have 
Mz = E(Z) = E(CX) = Cyy 
Xz = Cov(Z) = Cov(CX) = CXC’ (2-45) 


where wx and x are the mean vector and variance-covariance matrix of X, respec- 
tively. (See Exercise 2.28 for the computation of the off-diagonal terms in CXC’) 

We shall rely heavily on the result in (2-45) in our discussions of principal com- 
ponents and factor analysis in Chapters 8 and 9. 


Example. 2.15 (Means and covariances of linear combinations) Let X’ = [X,, X2] 
be arandom vector with mean vector w& = [), 2] and variance-covariance matrix 


O11, O12 
=x = 
G12 922 


Mean Vectors and Covariance Matrices 77 


Find the mean vector and covariance matrix for the linear combinations 
Z = X,—- X, 
Q= X + X, 


or 


in terms of pry and Ly. 


Here 
1 Allee My + M2 


= = raf hf{jon ot] 1 2 
Zz = Cov(Z) One ‘ | ie sl E i 


= ie — 2032 + 022 O11 ~ 022 | 


and 


7313 ~ 922,» Oy, + 2042 + O22 


Note that if 01; = o22—that is, if X,; and X, have equal variances—the off-diagona} 
terms in Xz vanish. This demonstrates the well-known result that the sum and differ- 
ence of two random variables with identical variances are uncorrelated. m 


Partitioning the Sample Mean Vector 
and Covariance Matrix 


Many of the matrix results in this section have been expressed in terms of population 
means and variances (covariances). The results in (2-36), (2-37), (2-38), and (2-40) 
also hold if the population quantities are replaced by their appropriately defined 
sample counterparts. 

Let x' = [x),X2,.-.,X | be the vector of sample averages constructed from 
n observations on p variables X,, X2,..., Xp, and let 


Sy Sp 
8, = : 

Sip Sop 
1S _ 2 1Z = rm 
n p> (x1 — X) ee > (xj1 ot X1) (Xp * py) 

ik . . ate . 
ig _ ch 1 2 _ 2 
Ff SS Ope Aya Sp) os —y, (tj = Xp) 
j=l 7 jay 


be the corresponding sample variance-covariance matrix. 


78 


Chapter 2 Matrix Algebra and Random Vectors 


The sample mean vector and the covariance matrix can be partitioned in order 
to distinguish quantities corresponding to groups of variables. Thus, 


xy 
is = x!) 
7 x x 
X =] = Lo (2-46) 
(xt) | Koay x 
Xp 
and 
S11 Siq i Sigtt S}p 
s. =|..7! Sgq + Sq.gti Sq p 
Pies Pt slept foc J rst ODEs Ree re 
(pXp) | Sgt Sg+ig i Sg+iq+i Sq+l,p 
Spl) °°" Spq } Spgti Spp 
a 
a | S11: Si2 
= ae ara (2-47) 
p-4| Soi ; S22 


where x) and x?) are the sample mean vectors constructed from observations 
x) = [xy,...,X9]' and x?) = [x,41,..., xp]’, respectively; S;; is the sample covari- 
ance matrix computed from observations x); $). is the sample covariance 
matrix computed from observations x); and $2 = $}; is the sample covariance 
matrix for elements of x‘) and elements of x”). 


9.7 Matrix Inequalities and Maximization 


Maximization principles play an important role in several multivariate techniques. 
Linear discriminant analysis, for example, is concerned with allocating observations 
to predetermined groups. The allocation rule is often a linear function of measure- 
ments that maximizes the separation between groups relative to their within-group 
variability, As another example, principal components are linear combinations of 
measurements with maximum variability. 

The matrix inequalities presented in this section will easily allow us to derive 
certain maximization results, which will be referenced in later chapters. 


Cauchy-Schwarz Inequality. Let b and d be any two p X 1 vectors. Then 
(b'd)’ < (b’b) (d’d) (2-48) 


with equality if and only ifb = cd (or d = cb) for some constant c. 


Matrix Inequalities and Maximization 79 


Proof. The inequality is obvious if either b = 0 or d = 0. Excluding this possibility, 
consider the vector b — xd, where x is an arbitrary scalar. Since the length of 
b — xdis positive for b — xd # 0, in this case 


0 < (b — xd)'(b — xd) = b’b — xd’b — b'(xd) + xd’ 
= b’b — 2x(b’d) + x*(d’d) 


The last expression is quadratic in x. If we complete the square by adding and 
subtracting the scalar (b’d)*/d’d, we get 


0 <b’b —- (way + (w'ay’ — 2x(b’d) + x?(d’d) 
dd dd 


rayy2 ' 
tin! (d'd) (« = a) 


d'd d‘d 
The term in brackets is zero if we choose x = b’d/d'd, so we conclude that 
(b'a)” 
d'd 
or (b’ d) < (b’b)(d‘d) ifb # xd for some x. 


Note that ifb = cd, 0 = (b — cd)'(b — cd), and the same argument produces 
(b’d)’ = (b’b) (d’d). = 


0 < b’b — 


A simple, but important, extension of the Cauchy—Schwarz inequality follows 
directly. 


Extended Cauchy-Schwarz Inequality. Let b and d_ be any two vectors, and 
let B_ bea positive definite matrix.Then (?*?) (px1) 


(pxp) 
(b'd)” < (b’Bb) (d’B“'d) (2-49) 
with equality if and only if b = cB™'d (or d = cBb) for some constant c. 
Proof. The inequality is obvious when b = 0 or d = 0. For cases other than these, 
consider the square-root matrix B!? defied in terms of its eigenvalues A; and 


the normalized eigenvectors e; as B'/? = > Vj; e,e'. If we set [see also (2-22)] 


B? = Dee! 


it follows that 

b'd = b'Id = b'B'?B-'d = (B'’b)' (Bd) 
and the proof is completed by applying the Cauchy—Schwarz inequality to the 
vectors (Bb) and (B™/d). = 


The extended Cauchy—Schwarz inequality gives rise to the following maximiza- 
tion result. 


g0 Chapter2 Matrix Algebra and Random Vectors 


Maximization Lemma. Let , B : be positive definite and : d_ be a given vector. 
. PXP. px1) 
Then, for an arbitrary nonzero Mester, =i 
px 


(vd) 
MX ee a Bd (2-50) 


with the maximum attained when x = cB! d_forany constantc ¥ 0. 
(px1) (eX p){px1) 


Proof. By the extended Cauchy-Schwarz inequality, (x'dy” = (x'Bx) (a’B'd), 
Because x # 0 and B is positive definite, x’Bx > 0. Dividing both sides of the 
inequality by the positive scalar x'Bx yields the upper bound 
(x'a)” : 
es OE, ‘R- 
x’Bx ei 
Taking the maximum over x gives Equation (2-5Q) because the bound is attained for 
xc Ba. a 


A final maximization result will provide us with an interpretation of eigenvalues. 


Maximization of Quadratic Forms for Points on the Unit Sphere. Let B bea 

he 8 . . . x 
positive definite matrix with eigenvalues A; = A,2=--- =A p = 0 and elena 
normalized eigenvectors €), €),...,€,. Then 


x’Bx P 
max—— =A, = (attained when x = e,) 
x¥0 XX 
(2-51) 
je BE (attained when x = e,) 
TO x'x & 
Moreover, 
max ca ha A (attained when x = e@4;,k = 1,2 
ele; & x'x aed si a al sy ee 


where the symbo! 1 is read “is perpendicular to.” 


Proof. Let P be the orthogonal matrix whose columns are the eigenvectors 
pXp 


€,, @2,...,e, and A be the diagonal matrix with eigenvalues A,, Ay,..., A, along the 
main diagonal. Let B’? = PAP’ [see (2-22)] and Ait Pox, 
x 
Consequently, x # Oimplies y * 0. Thus, . aS 


Bx x'BY’BY?x _ x'PA?P’PAYP’x __y’Ay 
x'x x a y’y yy 
(exp) 
s Aye D> 4 
= fl =, =), (2-53) 


Matrix Inequalities and Maximization 81 


Setting x = e, gives 


since 


seis. ee 
ei 10, k #1 


For this choice of x, we have y’Ay/y'y = A,/1 = Aj, or 


ej Be, 


= e}/Be, = 2-54 
ele, e\Be, = Ay (2-54) 


A similar argument produces the second part of (2-51). 
Now, x = Py = yye; + ez +--+ + ypep,sox 1 e1,...,e, implies 


0 = ejx = yeje, + weje, +---+ yeep = yj, isk 


Therefore, for x perpendicular to the first k eigenvectors e;, the left-hand side of the 
inequality in (2-53) becomes 


i=k+1 
x’x P 
2 
> yi 
i=KH1 
Taking yz+1 = 1, Yer2 =-'* = Yp = O gives the asserted maximum. na 


For a _ fixed xg * 0, x§Bxg/xgxq has the same value as x’Bx, where 
x’ = x6/Vxgxq_ is of unit length. Consequently, Equation (2-51) says that the 
largest eigenvalue, A,, is the maximum value of the quadratic form x’Bx for all 
points x whose distance from the origin is unity. Similarly, A, is the smallest value of 
the quadratic form for all points x one unit from the origin. The largest and smailest 
eigenvalues thus represent extreme values of x’Bx for points on the unit sphere. 
The “intermediate” eigenvalues of the p X p positive definite matrix B also have an 
interpretation as extreme values when x is further restricted to be perpendicular to 
the earlier choices. 


VECTORS AND MATRICES: 


Y 


BASIC CONCEPTS 


Vectors 


Many concepts, such as a person’s health, intellectual abilities, or personality, cannot 
be adequately quantified as a single number. Rather, several different measure- 


ments X; ,X2,---» Xm are required. 


Definition 2A.1. An m-tuple of real numbers (x1, X2,---, X1-+-,%m) arranged in a 
column is called a vector and is denoted by a boldfaced, lowercase letter. 
Examples of vectors are 
x) 1 : 1 
La os : a= |0), b= , y=| 2 
: 1 
9 -2 
Xin -1 


Vectors are Said to be equal if their corresponding entries are the same. 


Definition 2A.2 (Scalar multiplication). Let c be an arbitrary scalar. Then the 
product cx is a vector with ith entry cx;. 


To illustrate scalar multiplication, take cy = 5 and cy = —1.2. Then 
1 5 1 —1.2 
qy=5| 2)/= 10 | and cy =(-1.2)} 2|]=| -2.4 
-2 —10 2 2.4 


82 


Vectors and Matrices: Basic Concepts 83 


Definition 2A.3 (Vector addition). The sum of two vectors x and y, each having the 
same number of entries, is that vector 


=x+y withithentry z= 4x;+ y; 


Thus, 
3 1 4 
-1]+ = {1 
4 —2 2 
x + sy = 2 


Taking the zero vector, 0, to be the m-tuple (0, 0,...,0) and the vector —x to be the 
m-tuple (—x,, —X2,...,—X,,), the two operations of scalar multiplication and 
vector addition can be combined in a useful manner. 


Definition 2A.4. The space of all real m-tuples, with scalar multiplication and 
vector addition as just defined, is called a vector space. 


Definition 2A.5. The vector y = a,x, + a2Xz +-+: + a,x, is a linear combination of 


the vectors x), X2,...,X,. The set of all linear combinations of x,,x2,.-., Xx; is called 
their linear span. 
Definition 2A.6. A set of vectors x), X2,..., xX, is said to be linearly dependent if 


there exist A numbers (a), a2,..., ax), not all zero, such that 
QyX) + A2X2 +°°+ + AX, = 0 
Otherwise the set of vectors is said to be linearly independent. 
If one of the vectors, for example, x;, is 0, the set is linearly dependent. (Let a; be 
the only nonzero coefficient in Definition 2A.6.) 


The familiar vectors with a one as an entry and zeros elsewhere are linearly 
independent. For m = 4, 


1 0 0 0 

0 1 0 0 

x1 = ale x2 = ol x3 = 1/ x4 = 0 

0 0 0 1 

so 

a,°1 + a2:0 + a3°0 + ag'0 ay 
P _ | 10 4+ az-1 + a3-0+ ag-0] _ | a 
i aa Beet ee a a,°0 + a,-0 + a3+1 + a,-0 a3 
a,:0 + a,°0 + a3°0 + age a4 


implies that a; = a2 = a3 = a, = 0. 


84 Chapter 2 Matrix Algebra and Random Vectors 


As another example, let & = 3 and m = 3, and let 


1 2 
x; = 1 - x2, = 5 A x3 > 1 
1 ~1 J 


Then 
2x; — X2 + 3x;=0 


Thus, x1, X2, 3 are a linearly dependent set of vectors, since any one can be written 
as a linear combination of the others (for example, x. = 2x; + 3x3). 


Definition 2A.7. Any set of m linearly independent vectors is called a basis for the 
vector space of all m-tuples of real numbers. . 


Result 2A.1. Every vector can be expressed as a unique linear combination of a 
fixed basis. = 


With m = 4, the usual choice of a basis is 


1 0 0 0 
0 1 0 0 
0|’ oO; 1|’ 0 
0 0 0 1 


These four vectors were shown to be linearly independent. Any vector x can be 
uniquely expressed as 


1 0 0 0 xy 
0 1 0 0 
xy 0 + X2 0 as) + X% 0 = ts =x 
3 
0 0 0 1 x4 


A vector consisting of m elements may be regarded geometrically as a point in 
m-dimensional space. For example, with m = 2, the vector x may be regarded as 
representing the point in the plane with coordinates x, and x2. 

Vectors have the geometrical properties of length and direction. 


Definition 2A.8. The length of a vector of m elements emanating from the origin is 
given by the Pythagorean formula: 


length ofx = L, = Veet axe, 


Vectors and Matrices: Basic Concepts 85 


Definition 2A.9. The angle @ between two vectors x and y, both having m entries, is 
defined from 


(xn F XQ Fo t XmYm) 


cos(@) = ik, 
where L, = length of x and Ly = length of y, x1, x2,..., x, are the elements of x, 
and yy, ),---, ¥ are the elements of y. 
Let 
~1 4 
5 -3 
x= > and y= 0 
—2 1 


Then the length of x, the length of y, and the cosine of the angle between the two 
vectors are 


lengthofx = V(—1)? + 5? + 2? + (—2)? = V34 =5.83 
length of y = V4? + (-3)) + 0 + 2 = V26 = 5.10 
and 


1 1 
cos (6) = LL [xyyz + x2} + X3)93 + x4ya] 
x Ly 


| 
= ag vag [(H14 + 5(-3) + 20) + (-2)1] 
1 
= 383 x 540 1-21] = —-706 


Consequently, 8 = 135°. 
Definition 2A.10. The inner (or dot) product of two vectors x and y with the same 
nuinber of entries is defined as the sum of component products: 

XM + X2y2 to + Xm 


We use the notation x’y or y’x to denote this inner product. 


With the x’y notation, we may express the length of a vector and the cosine of 
the angle between two vectors as ; 


Ly = lengthofx = Vx3 + x34 ---+ x2, = Vx'x 


Vx'x Vy'y 


86 Chapter 2 Matrix Algebra and Random Vectors 


Definition 2A.11. When the angle between two vectors x, y is 8 = 90° or 270°, we 
say that x and y are perpendicular. Since cos(@) = 0 only if 6 = 90° or 270°, the 
condition becomes 

x and y are perpendicular if x'y = 0 
We write x 1 y. . 


The basis vectors 


1 0 0 0 
0 1 0 0 
0)’ Oo; 1) 0 
0 0 0 1 


are mutually perpendicular. Also, each has length unity. The same construction 
holds for any number of entries m. 


Result 2A.2. 

(a) zis perpendicular to every vector if and only if z = 0. 

(b) If z is perpendicular to each vector x,,X,...,X;, then z is perpendicular to 
their lmear span. 

(c) Mutually perpendicular vectors are linearly independent. = 


Definition 2A.12. The projection (or shadow) of a vector x on a vector y is 
x’ 
projection of X ony = ae y 
y 


If y has unit length so that Ly = 1, 


projection ofx ony = (x'y)y 


If y;, yo,..-, y, are mutually perpendicular, the projection (or shadow) of a vector x 
on the linear span of y,, Y2, ..-Y; iS 

(%) ed (xy) es ay.) 

yiyi wy Y:Yr 


Result 2A.3 (Gram-Schmidt Process). Given linearly independent vectors x), 
X»,..-,X,, there exist mutually perpendicular vectors u;, u:,..., u, with the same 
linear span. These may be constructed sequentially by setting 


Uy) = xX; 
ges (x2u;) = 
2 2 “au, 1 
(x,t) (xi -1) 
Uy =X, — a) 7 i 


, 
uj u, O,-10y-1 


Vectors and Matrices: Basic Concepts 87 


We can also convert the u’s to unit length by setting z; = uj/Vujuj. In this 
k-l 

construction, (x;z;) Z; is the projection of x, on z; and D (xjz;)z; is the projection 
j=1 

of x, on the linear span of X,,X2,.-.,Xx-1- r] 


For example, to construct perpendicular vectors from 


4 3 
0 1 
x, = 0 and X= 0 
2 -1 
we take 
4 
Wy 1 0 
2 
so 
uju, = 47 + 07 + 0? + 2? = 20 
and 
xia, = 3(4) + 1(0) + 0(0) — 1(2) = 10 
Thus, 
3 4 1 4 1 
ree 1{_ 10;o|_{ 1 “a 1/0 ee ae 
: 0! 20/0 o| 7° 4 lol 2 Vel o 
-1 2 -2 2 -2 
Matrices 


Definition 2A.13. An m X k matrix, generally denoted by a boldface uppercase 
letter such as A, R, ©, and so forth, is a rectangular array of elements having m rows 
and & columns. 


Examples of matrices are 


at ee 100 

A=] 014, B-|' ep 1=/|0 10 
3 4 / 001 
| ee ee 

E=} 72 1], E=fea) 


88 Chapter 2 Matrix Algebra and Random Vectors 


In our work, the matrix elements will be real numbers or functions taking on values 
in the real numbers. 


Definition 2A.14. The dimension (abbreviated dim) of anm x k matrix is the ordered 
pair (m, k); mis the row dimension and k is the column dimension. The dimension of a 
matrix is frequently indicated in parentheses below the letter representing the matrix. 
Thus, the m X k matrix A is denoted ee ~ 

m 


In the preceding examples, the dimension of the matrix ¥ is 3 x 3, and this 


information can be conveyed by writing ay. 
x 


An m X k matrix, say, A, of arbitrary constants can be written 


4, 4&2 AK 

A =| 721 422 “*° 42k 
(mxk) : : 

Am} 4m2 °** mk 


or more compactly as > = {a;;}, where the index i refers to the row and the 
mXk 


index j refers to the column. 

An m X 1 matrix is referred to as a column vector. A 1 X k matrix is referred 
to as a row vector. Since matrices can be considered as vectors side by side, it is nat- 
ural to define multiplication by a scalar and the addition of two matrices with the 
same dimensions. 


Definition 2A.15. Two matrices ~ = {a;;} sa oi = {b;;} are said to be equal, 
m mx, 

written A = B, ifa;; = 6;;,i = 1,2,...,m,j = 1,2,...,k. That is, two matrices are 

equal if 

(a) Their dimensionality is the same. 

(b) Every corresponding element is the same. 


Definition 2A.16 (Matrix addition). Let the matrices A and B both be of dimension 
m X k with arbitrary elements a;; and 5;;, i = 1,2,...,m, j = 1,2,...,k, respec- 
tively. The sum of the matrices A and B is an m X k matrix C, written C = A + B, 
such that the arbitrary element of C is given by 


Cig = aij + Di; t= 1,2,...,m, j=i,2,...,k 


Note that the addition of matrices is defined only for matrices of the same 
dimension. 


For example, 


> 
+ 
ws 
ul 

io) 


Vectors and Matrices: Basic Concepts 89 


Definition 2A.17 (Scalar multiplication). Let c be an arbitrary scalar and A {a;;}. 
mx 


TE een ea eRe {bij}, where bj; = cajj = ajc, i= 1,2,.-.,m, 


Multiplication of a matrix by a scalar produces a new matrix whose elements are 
the elements of the original matrix, each multiplied by the scalar. 


For example, if c = 2, 


3-4 3 -4 6 ~—8 
2/2 6{,= 1/2 6]2 =|]4 12 

0 5 0 5 0 10 

cA = Ac = B 


Definition 2A.18 (Matrix subtraction). er ae = {a;;} and ae = {b;;} be two 
mx mx 
matrices of equal dimension. Then the difference between A and B, written A — B, 
isan m X k matrix C = {c;;} given by 
C=A-B=A + (-1)B 

That is, ¢;; = aij + (-1)8;; = aj; = bij, 8 _ 1, 2, .,m,f =a 1,2, engi: 

Definition 2A.19. Consider the m x k matrix A with arbitrary elements a;;,i = 1, 
2,...,m, j =1,2,...,k. The transpose of the matrix A, denoted by A’, is 
the k X m matrix with elements a;;, j = 1,2,...,k,i = 1,2,...,m. That is, the 


transpose of the matrix A is obtained from A by interchanging the rows and 
columns. 


As an example, if 


7 

2 3 
A = ‘ , then A’ =|1 —4 
(2x3) 7 -4 6 (3X2) 3 6 


Result 2A.4. For all matrices A, B, and C (of equal dimension) and scalars c and d, 
the following hold: 

(a) (A +B) + C=A+(B+C) 

(b) A+B=BH+A 

(c) c(A + B) =cA + cB 

(d) (c+ d)A=cA+dA 


(e) (A + B)’ = A’ + B’ (That is, the transpose of the sum is equal to the 
sum of the transposes.) 
(f) (cd)A = c(dA) 


(g) (cA)! =cA’ P] 


90 Chapter 2 Matrix Algebra and Random Vectors 


Definition 2A.20. If an arbitrary matrix A has the same number of rows and columns, 
then A is called a square matrix. The matrices &, I, and E given after Definition 24.13 
are square matrices. 


Definition 2A.21. Let Abeak xX k (square) matrix. Then A is said to be symmetric 
if A = A’. That is, A is symmetric if aj; = a,;;,/ = 1,2,...,k, 7 = 1,2,...,k. 


Examples of symmetric matrices are 


gk A RF A 
Rk AS 


DB 

x 

coe 

I 
oo" 
ore 
ere OO 
rs) 

x 

2 

| 

za N 
ee 
LJ 
cs 

x 

_ 

I 
“A mh AR 
Qo SF 4 


Definition 2A.22. The k X k identity matrix, denoted by oe is the square matrix 
x 


with ones on the main (NW-SE) diagonal and zeros elsewhere. The 3 X 3 identity 
matrix is shown before this definition. 


Definition 2A.23 (Matrix multiplication). The product AB of an m X n matrix 
A = {a;;} and ann X k matrix B = {b;,} is the m x k matrix C whose elements 
are 


n 
cig = D> aie be; P=1,2,...,m j=1,2,...,k 
€=1 


Note that for the product AB to be defined, the column dimension of A must 
equal the row dimension of B. If that is so, then the row dimension of AB equals 
the row dimension of A, and the column dimension of AB equals the column 
dimension of B. 


For example, let 


Then 


Vectors and Matrices: Basic Concepts 9 I 


where 
€11 = (3)(3) + (~1)(6) + (2)(4) = 11 
e12 = (3)(4) + (-1)(-2) + (2)(3) = 20 
€21 = (4)(3) + (0)(6) + (5)(4) = 32 
€22 = (4)(4) + (0)(—2) + (5)(3) = 31 


As an additional example, consider the product of two vectors. Let 


1 2 
0 -3 
x={_, and y= i 
3 -8 
Thenx’ =[{1 0 -2 3]and 
2 1 
< =3 0 , 
xy=[1 0 -2 3] = =(-20])=[2 -3 -1 —8] Be ee 
-8 3 


Note that the product xy is undefined, since x isa 4 X 1 matrix and yisa4 xX 1 ma- 
trix, so the column dim of x, 1, is unequal to the row dim of y, 4. If x and y are vectors 
of the same dimension, such as n x 1, both of the products x’y and xy’ are defined. 
In particular, y’x = x'y = x;y + Xy + -+++ xX,y,, and xy’ is an n X nm matrix 
with i, jth element x;yj. 


Result 2A.5. For all matrices A, B, and C (of dimensions such that the indicated 
products are defined) and a scalar c, 


(a) c(AB) = (cA)B 

(b) A(BC) = (AB)C 

(c) A(B + C) = AB+ AC 
(d) (B+ C)A = BA +CA 
(e) (AB)' = B’A’ 


Mote generally, for any x, such that Ax; is defined, 


(f) S)Ax,=A Dx; 
j=l fl 


(g) > (Ax;)(Ax;)! = a(S 1) A’ 
I= j= 


92 Chapter 2 Matrix Algebra and Random Vectors 


There are several important differences between the algebra of matrices and 
the algebra of real numbers. Two of these differences are as follows: 


1. Matrix multiplication is, in general, not commutative. That is, in general, 
AB # BA. Several examples will illustrate the failure of the commutative law 


(for matrices). 
3 ~-1|[o]_[-2 
4 7|{2] | 14 


but 


is not defined. 


but 
7 6 1 01 19 -18 43 
-3 1 2 3 6 =|-1 -3 3 
24 10 -12 26 
Also, 
4 -1 2.1 - 11 0 
0 1{|-3 4 ~3 4 
but 


2 1{/4 -1 y 8 -1 

-3 4;/0 1 -12 #7 
Let 0 denote the zero matrix, that is, the matrix with zero for every element. In 
the algebra of real numbers, if the product of two numbers, ab, is zero, then 


a = Oorb = 0. In matrix algebra, however, the product of two nonzero matri- 
ces may be the zero matrix. Hence, 


Nv 


AB = 0 
(mxn)(axk) (mxk) 


does not imply that A = 0 or B = 0. For example, 


bey 3-0] 


It is true, however, that if either A = 0 or B = @, then 
a (mxa) (mxn) (nxk) (Xk) 


A B 0. 
(mXn)}(nxk) (mXk) 


Vectors and Matrices: Basic Concepts 93 


Definition 2A.24. The determinant of the square k x k matrix A = {a;;}, denoted 
by |A|, is the scalar 


|A | = @ ifk =1 
k 
s=1 


where Aj; is the (k — 1) X (k — 1) matrix obtained by deleting the first row and 
k 


jth column of A. Also,|A| = >) a,;|Aj;|(—1)'*/, with the ith row in place of the first 
j=l 


row. 


Examples of determinants (evaluated using Definition 2A.24) are 


1 3 
, | = 1)4|(-1)? + 3/6|(-1)? = 1(4) + 3(6)(—1) = -14 
In general, 
a, a 
a 421 = ayya97(—1)? + ay2424(—1)? = ay 4a22 — a 4202; 
21 422 
3 1 6 
4 
7 4 5) 53 late al? 2-13 + 6 [7 “(04 
2-7 1 “ sie —— 


= 3(39) — 1(-3) + 6(-57) = -222 


10 0 
01 0 =f Vent +0 
001 


0 0 01 7 
“ olay + of) 5|(-»* =1(1) =1 


If Lis the k X k identity matrix, |I| = 1. 


41; 412 4&3 
421 422 423 
431 432 433 


a a a a a a 

= ay 22 3 (-1)? ies 21 423 (-1)3 iy 21 422 
1 13 

432 433 431 433 43, 432 


(~1)* 


= 11472433 + 412073031 + 421432013 — 4314772413 —~ 421412433 — 432023411 


The determinant of any 3 x 3 matrix can be computed by summing the products 
of elements along the solid lines and subtracting the products along the dashed 


94 Chapter 2 Matrix Algebra and Random Vectors 


lines in the following diagram. This procedure is not valid for matrices of higher 
dimension, but in general, Definition 2A.24 can be employed to evaluate these 
determinants. 


We next want to state a result that describes some properties of the determinant. 
However, we must first introduce some notions related to matrix inverses, 


Definition 24.25. The row rank of a matrix is the maximum number of linearly inde- 
pendent rows, considered as vectors (that is, row vectors). The column rank of a matrix 
is the rank of its set of columns, considered as vectors. 


For example, let the matrix 
11 
A=|2 5 -1 
01 -il 


The rows of A, written as vectors, were shown to be linearly dependent after 
Definition 2A.6. Note that the column rank of A is also 2, since 


1 1 1 0 
—2/}2/+/5)4+)-1;=]0 
0 1 -1 0 


but columns 1 and 2 are linearly independent. This is no coincidence, as the 
following result indicates. % 


Result 2A.6. The row rank and the column rank of a matrix are equal. = 


Thus, the rank of a matrix is either the row rank or the column rank. 


Vectors and Matrices: Basic Concepts 95 


Definition 2A.26. A square matrix A is nonsingularif A x = © implies 
(kXk) (KXK)AXL) — (kX1) 
that x = 0 . Ifa matrix fails to be nonsingular, it is called singular. Equivalently, 


(kx1) (kx1) 
a square matrix is nonsingular if its rank is equal to the number of rows (or columns) 
it has. 


Note that Ax = x ,a,; + x2a) + --: + x,a,, where a; is the ith column of A, so 
that the condition of nonsingularity is just the statement that the columns of A are 
linearly independent. 


Result 2A.7. Let A be a nonsingular square matrix of dimension k x k. Then there 
isa unique k X k matrix B such that 


AB = BA =I 


where I is the k X k identity matrix. . = 


Definition 2A.27. The B such that AB = BA = I is called the inverse of A and is 
denoted by A™. In fact, if BA = I or AB = J, then B = A‘, and both products 
must equal I. 


For example, 


since 


Result 2A.8. 


(a) The inverse of any 2 x 2 matrix 


is given by 


96 Chapter 2 Matrix Algebra and Random Vectors 


is given by 
472 423/ _ {412 413 412 43 
432 433 432 433 422 493 
_Al= 1 | fa a3 4, 413] |411 413 
[Al | [431 433} [@31 @33)  [@21 a3 
421 422} 411 412 431 aj2 
43, 432 43) 432 421 422 


In both (a) and (b), it is clear that | A| # 0 if the inverse is to exist. 
(c) In general, A” has j, ith entry []A;;]/]A]](~1)'*’, where A,; is the matrix 
obtained from A by deleting the ith row and jth column. Y] 


Result 2A.9. Fora square matrix A of dimension k X k, the following are equivalent: 


= 0 implies = 0 i i ; 
a) ea denis impli ah 6h (A is nonsingular) 
(b) {[A| ¥ 0. 


(c) There exists a matrix A”! such that AA? = A'A = = 


(kXk) 
Result 2A.10. Let A and B be square matrices of the same dimension, and let the 
indicated inverses exist. Then the following hold: 

(a) (A) = (A) 

(b) (AB)? = B'A™ - 


The determinant has the following properties. 


Result 2A.11. Let A and B be k x k square matrices. 

(a) |A| = |A’ 

(b) If each element of a row (column) of A is zero, then | A| = 0 
(c) If any two rows (columns) of A are identical, then | A| = 0 
(d) If A is nonsingular, then |A| = 1/| A™'|; that is,|A |] A? | = 1. 
(e) |AB| = |A|/B| 

(f) [cA| = c*/ Al, where c isa scalar. 


You are referred to [6} for proofs of parts of Results 2A.9 and 2A.11. Some of 
these proofs are rather complex and beyond the scope of this book. ia 


Definition 2A.28. Let A = {a;;} beak X ksquarematrix. The trace of the matrix A, 


k 
written tr (A), is the sum of the diagonal elements; that is, tr(A) = > a;;. 
1 


Vectors and Matrices: Basic Concepts 97 


Result 2A.12. Let Aand B be k X k matricesand c be a scalar. 

(a) tr(cA) = ctr(A) 

(b) tr(A + B) = tr(A) + tr(B) 

(c) tr(AB) = tr(BA) 

(d) tr(B"'AB) = tr(A) 

(e) tr(AA’) = s s a, oa 


i=1 j=1 


Definition 2A.29. A square matrix A is said to be orthogonal if its rows, considered 
as vectors, are mutually perpendicular and have unit lengths; that is, AA’ = I. 


Result 2A.13. A matrix A is orthogonal if and only if A = A’. For an orthogonal 
matrix, AA’ = A’A = I, so the columns are also mutually perpendicular and have 
unit lengths, = 


An example of an orthogonal matrix is 


Ne Ni= Nie Nie 


NIP Ne NR Nl 


Hl 
NIe NIK NI NI 
NIP NIP NR NPS 


Clearly,A = A’,so AA’ = A’A = AA. We verify that AA =I = AA’ = A’A, or 


ae atl de Se) as a ae, al 

Ze 24261 -2., $2 2 2 2 2 1000 
bo ST > a A 9 eas Caos a | 

ra Sa ane 2 72 2 2{_]0 1 0 0 
2 12 _1 2 F, od 2 WS 

2°. SB wR 2 2. 0(2P 2 ad 00 1 0 
de ob a zd 1% <1; Yale 8 0001 
20 Bo eB OD 2. 2 2 2 

A A = I 


so A’ = A”!, and A must be an orthogonal matrix. 
Square matrices are best understood in terms of quantities called eigenvalues 
and eigenvectors. 


Definition 2A.30. Let Abe ak X k square matrix and I be the k X & identity ma- 
trix. Then the scalars A;, A2,..., Ay Satisfying the polynomial equation| A — AI| = 0 
are called the eigenvalues (or characteristic roots) of a matrix A. The equation 
|A — AI| = 0(as a function of A) is called the characteristic equation. 


For example, let 


98 Chapter 2 Matrix Algebra and Random Vectors 


[A ~All = Ik Js ‘| 


Then 


2 _{l-A ie) =(1- a3 
1 3-aAl R= A= 0 
implies that there are two roots, A; = Land A, = 3. The eigenvalues of A are 3 
and 1. Let 
13 -4 2 
A=|~-4 13 -2 
2 -2 10 
Then the equation 
i3-A -4 2 
|A - Al] = -4 13-2 ~2] = —d3 + 36a? — 405A + 1458 = 0 
2 -2 10—-A 


has three roots: A; = 9, A2 = 9, and A; = 18; that is, 9,9, and 18 are the eigenvalues 
of A. 


Definition 2A.35. Let A be a square matrix of dimension k x & and let A be an eigen- 
value of A.If x isanonzerovector{ x # 0 )suchthat 
(kx1) (KX1) (kX1) 
Ax = Ax 
then x is said to be an eigenvector (characteristic vector) of the matrix A associated with 
the eigenvalue A. 


An equivalent condition for A to be a solution of the eigenvalue-eigenvector 
equation is |A ~ Al| = 0. This follows because the statement that Ax = Ax for 
some A and x ¥ 0 implies that 


0 = (A — AI)x = x, col,(A — AL) +--+ + x, col,(A ~ AL) 


That is, the columns of A ~ Al are linearly dependent so, by Result 24.9(b), 
|A — AI| = 0, as asserted. Following Definition 24.30, we have shown that the 


eigenvalues of 
1 0 
aL 3] 


are A, = 1 and Az = 3. The eigenvectors associated with these eigenvalues can be 
determined by solving the following equations: 


ine 


Vectors and Matrices: Basic Concepts 99 


From the first expression, 


xy + 3x2 = X2 
or 
Xy = -2x, 


There are many solutions for x, and x. 
Setting x2 = 1 (arbitrarily) gives x} = ~2, and hence, 


“fl 


is an eigenvector corresponding to the eigenvalue 1. From the second expression, 
xy = 3x 1 
xy + 3x2 = 3x2 


implies that x; = O and x, = 1 (arbitrarily), and hence, 


a 


is an eigenvector corresponding to the eigenvalue 3. It is usual practice to determine 
an eigenvector so that it has length unity. That is, if Ax = Ax, we take e = x/Vx’x 
as the eigenvector corresponding to A. For example, the eigenvector for A, = 1 is 


e, = [-2/V5, 1/V5]. 


Definition 2A.32. A quadratic form Q(x) in the k variables x, x2,..., x, isQ(x) = x’Ax, 
where x’ = [x 1, X2,...,x,] and Aisak X k symmetric matrix. 


Note that a quadratic form can be written as Q(x) = by a; ;x,x;. For example, 
i 


Q(x) = [x1 x2] f | i = xf t+ 2xyx_ + x} 


Q(x) = [x, x2 x3]}3 -1 -2 || xp | = xd 4 Oxyx, — x3 — 4xyx3 + 2x3 
Oo alls 


k ok 
=1 


j=l 


Any symmetric square matrix can be reconstructured from its eigenvalues 
and eigenvectors. The particular expression reveals the relative importance of 
each pair according to the relative size of the eigenvalue and the direction of the 
eigenvector. ‘ 


{00 Chapter2 Matrix Algebra and Random Vectors 


Result 2A.14. The Spectral Decomposition. Let Abe ak X k symmetric matrix. 
Then A can be expressed in terms of its k eigenvalue-eigenvector pairs (A;, e;) as 


k 
A= > ree m 


i=l 


22 4 
a-| 4 . 


|A — AI] = A? — 5A + 6.16 — 16 = (A — 3)(A — 2) 


For example, let 


Then 


so A_has eigenvalues A; = 3 and A, = 2. The corresponding eigenvectors are 
e| = [1/V5, 2/V5| and e} = [2/V5, -1/V5], respectively. Consequently, 


1 2 
Pee eae Re | ae ee Wee Or 
a DB 2 as AW a1 LVS V5 
V5 V5 


_| 6 1.2 ii 16 -—8 

TED 2a -8 A 
The ideas that lead to the spectral decomposition can be extended to provide a 
decomposition for a rectangular, rather than a square, matrix. If A is a rectangular 


matrix, then the vectors in the expansion of A are the eigenvectors of the square 
matrices AA’ and A’A. 


Result 2A.15. Singular-Value Decomposition. Let Abe an m X k matrix of real 
numbers. Then there exist an m X m orthogonal matrix U anda k X k orthogonal 
matrix V such that 


A =UAV' 
where the m X k matrix A has (i, /) entry A; = Ofori = 1,2,..., min(m, &) and the 


other entries are zero. The positive constants A; are called the singular values of A. ™ 


The singular-value decomposition can also be expressed as a matrix expansion 
that depends on the rank r of A. Specifically, there exist r positive constants 
A1,A2,---;A,,7 orthogonal m X 1 unit vectors uj, u,...,u,, and r orthogonal 
k X 1,unit vectors vj, ¥2,..., ¥,, Such that 


A= > Aju; Y; = U,A,V;, 
i=1 


where U, = [uy, u2,...,u,], V, = [V, V2,---, ¥;],and A, is an7 X 7 diagonal matrix 
with diagonal entries A,. 


Vectors and Matrices: BaSic Concepts 10} 


Here AA’ has eigenvalue-eigenvector pairs (A?, u;), SO 
AA‘u, = AF uj 


with AZ, A3,...,A2 > O = A244, A249,..., 02, (for m > k). Then v; = A7'A‘u,. Alter- 
natively, the v; are the eigenvectors of A’A with the same nonzero eigenvalues A. 

The matrix expansion for the singular-value decomposition written in terms of 
the full dimensional matrices U, V, A is 


= ’ 


A = U 
(mxk) (mXm)(mxk)(kKxXk) 


where U has m orthogonal eigenvectors of AA‘ as its columns, V has k orthogonal 
eigenvectors of A’A as its columns, and A is specified in Result 2A.15. 


For example, let 
3 11 
a=(3 3 i 


3 -1 
3 1 1 11 1 
13 1 11 1 11 
You may verify that the eigenvalues y = A? of AA’ satisfy the equation 


y — 22y + 120 = (y— 12)(y — 10), and consequently, the eigenvalues are 
y, =At=12 and y,=AZ=10. The corresponding eigenvectors are 


1 1 ; 1 -1 . 
ui = E 5| and u} = E Zh respectively. 


3-s1 10 0, 2 
A’A =] 1 (35 i]-| © 0 4 


pio a4 2 42 


Then 


_ 


so|A’A ~ yI| = —y? — 22y? — 120y = —y(y — 12) (y — 10), and the eigenvalues 
are y, = Aj = 12, y. = AZ = 10, and y3 = A4 = 0. The nonzero eigenvalues are the 
same as those of AA’. A computer calculation gives the eigenvectors 


Cea 2 eel Ze Gh and v3 = , 3 
: ve V6 V6! ? VS V5 ; : V¥30 V30~ «V30 {| 


Eigenvectors v, and v2 can be verified by checking: 


10 02), {1 ae 
A‘'Ayv; =} 0 10 4|]— =| 2 |=12— | 2] =Aty, 
2 4 2|8l4 vol 4 
10 02), 2 ; 2 
A’Ayv, =| 0 10 4|]—<| -1}]=10~-~| -1 | = adv, 
2 42) 0 V5 9 


102 Chapter 2 Matrix Algebra and Random Vectors 


Taking A, = V12 and A, = V/10, we find that the singular-value decomposition of 


Ais 
ee a a 
Brien 3. 4 


1 1 

vw2\— 1 2 1 Vall2 1 
valle we vel | “|e | 

v2 v2 


The equality may be checked by carrying out the operations on the right-hand side. 

The singular-value decomposition is closely connected to a result concerning 
the approximation of a rectangular matrix by a lower-dimensional matrix, due to 
Eckart and Young ([2]).If am x & matrix A is approximated by B, having the same 
dimension but lower rank, the sum of squared differences 


(a;; — 5;;)’ = t[(A — B)(A — B)'] 


Ma: 
Me 


il 
N 


mj 


Result 2A.16. Let Abe an m xX k matrix ofreal numbers with m = k and singular 
value decomposition UAV’. Let s < k = rank(A). Then 


Ss 
B= > Ay! 
Ai 


is the rank-s least squares approximation to A. It minimizes 
tr[(A — B)(A - B)'] 
over all m X & matrices B having rank no greater than s. The minimum value, or 
k 


Pe er 2 
error of approximation, Is > Aj. = 
i=st+1 


To establish this result, we use UU’ = I, and VV’ = I, to write the sum of 
squares as 


tr[((A — B)(A — B)'] = tr[(UU'(A — B) VV(A — B)’] 


tr[U'(A — B) VV'(A — B)’U] 


mioik m 
= tr[(A — CA ~ CYT = 3B Ay ay = > (A- cu)? + OD fj 


itj 


0 
_ 

~ 

" 
_ 


where C = U’BV. Clearly, the minimum occurs when c;; = Ofori # j and c;; = A; for 
s 


the s largest singular values. The other c;; = 0. That is, UBV’ = A, or B = S Aju; vi. 


i=] 


Exercises 


Exercises 103 


2.1. 


2.2. 


2.3. 


2.4. 


2.5. 


2.6. 


Letx' = [5, 1, 3]andy’=[-1, 3, 1]. 


- (a) Graph the two vectors. 


(b) Find (i) the length of x, (ii) the angle between x and y, and (iii) the projection of y on x. 


(c) Since X¥=3 and y=1, graph [5 ~ 3,1 — 3,3 - 3] = [2, 


[4=43 = 4) = 11-220), 


Given the matrices 


py 4 -3 5 
rela B= 1 -2), and C=} -4 
-2 0 2 
perform the indicated multiplications. 
(a) 5A 
(b) BA 
(c) A’B’ 
(d) C’B 


(e) Is AB defined? 


Verify the following properties of the transpose when 


21 142 1 4 
a-(? Hi B=|! 0 a and cx} | 


(a) (A')'=A 

(b) (C’)* = (C7)' 

(c) (AB)’ = B'A’ 

(d) For general A and B ,(AB)’ = B’A’. 
(mxk) (kx @) 


When A™ and B7! exist, prove each of the following. 

(a) (A’)? = (A 

(b) (AB)? = BUA? : 
Hint: Part a can be proved by noting that AA? = I,I =I’, and (AA‘?)’ 
Part b follows from (B'A)AB = B7(A7A)B = BUB = 1. 


Check that 
Ss 2 
13° 13 
& 8) 
13°13 
is an orthogonal matrix. 


Let 


(a) Is A symmetric? 
(b) Show that A is positive definite. 


—2,0] and 


= (ATA' 


104 Chapter 2 Matrix Algebra and Random Vectors 


' 2.7. Let A be as given in Exercise 2.6. 


2.8. 


2.9. 


2.10. 


. 


2.12, 


2.13. 


2.14. 


2.15. 


2.16. 


(a) Determine the eigenvalues and eigenvectors of A. 
(b) Write the spectral decomposition of A. 

(c) Find A. 

(d) Find the eigenvahies and eigenvectors of A '. 


“La 


find the eigenvalues A, and A, and the associated normalized eigenvectors e, and e2. 
Determine the spectral decomposition (2-16) of A. 

Let A be as in Exercise 2.8. 

(a) Find A?. 

(b) Compute the eigenvalues and eigenvectors of A”. 


(c) Write the spectral decomposition of A™', and compare it with that of A from 
Exercise 2.8. 


Given the matrix 


Consider the matrices 


4 4.001 4 4.001 
wee ber on pnd Re bs sei 


These matrices are identical except for a small difference in the (2,2) position. 
Moreover, the columns of A (and B) are nearly linearly dependent. Show that 
A’! = (—3)B7!. Consequently, small changes—perhaps caused by rounding—can give 
substantially different inverses. 

Show that the determinant of the p X p diagonal matrix A = {a;;} with a;; = 0,i # j, 
is given by the product of the diagonal elements; thus,|A| = a,,22.°+-@ ep 

Hint: By Definition 2A.24, |A| = @,,;A,;; + 0+---+ 0. Repeat for the submatrix 
A), obtained by deleting the first row and first column of A. 

Show that the determinant of a square symmetric p X p matrix A can be expressed as 
the product of its eigenvalues Aj, Az,...,A,; that is,| A] = IT, A;- 

Hint: From (2-16) and (2-20), A = PAP’ with P’P =I. From Result 2A.11(e), 
|A|=|PAP’|=|P||AP'|=|P||A||P’|=]A]|I], since |I|=|P’P|=|P'||P|. Apply 
Exercise 2.11. 

Show that |Q| = +1] or -1 if Qisa p X p orthogonal matrix. 

Hint: |QQ'| = |I|. Also, from Result 2A.11,/Q|]Q’| = |Q |?. Thus, | Q |? = |I|. Now 
use Exercise 2.11. 


Show that Q’ AQ and A have the same eigenvalues if Q is orthogonal. 
(PX p)( PX p)( pX p) (PX p) 


Hint: Let A be an eigenvalue of A. Then 0 = |A — Al]. By Exercise 2.13 and Result 
2A.11(e), we can write 0 = {Q’||A — AI||Q] = |Q’AQ — Al], since Q’Q = IL. 

A quadratic form x’A x is said to be positive definite if the matrix A is positive definite. 
Is the quadratic form 3x3 + 3x3 — 2x;X2 positive definite? 


Consider an arbitrary n X p matrix A. Then A’A is a symmetric p X p matrix. Show 
that A’A is necessarily nonnegative definite. 
Hint Set y = Axsothaty’y = x’A’Ax. 


2.17. 


2.18. 


2.19. 


2.20. 


2.21. 


2.22. 


2.23. 


2.24. 


Exercises 105 


Prove that every eigenvalue of ak X k positive definite matrix A is positive. 
Hint: Consider the definition of an eigenvalue, where Ae = Ae. Multiply on the left by 
e' so that e‘Ae = Ae’e. 


Consider the sets of points (x,, x.) whose “distances” from the origin are given by 
co? = 4x2 + 3x3 -— 2V2xyx, 


for c? = 1 and for c? = 4. Determine the major and minor axes of the ellipses of con- 
stant distances and their associated lengths. Sketch the ellipses of constant distances and 
comment on their positions. What will happen as c? increases? 


m 
Let Ae = S VA; ee; = PA’/?P’, where PP’ = P’P = I. (The A;'s and the e's are 
mX nt i=] 
the eigenvalues and associated normalized eigenvectors of the matrix A.) Show Properties 


(1)-(4) of the square-root matrix in (2-22). 


Determine the square-root matrix A}, using the matrix A in Exercise 2.3. Also, deter- 
mine A7!/, and show that Al/2A7!/2 = A“V2Al/? = J, 


(See Result 2A.15) Using the matrix 


1 1 
A=j)2 ~2 
ar 


(a) Calculate A’A and obtain its eigenvalues and eigenvectors, 


(b) Calculate AA’ and obtain its eigenvalues and eigenvectors. Check that the nonzero 
eigenvalues are the same as those in part a. 


(c) Obtain the singular-value decomposition of A. 
(See Result 2A.15) Using the matrix 


48 8 
a-( 6 | 


(a) Calculate AA’ and obtain its eigenvalues and eigenvectors. 

(b) Calculate A’A and obtain its eigenvalues and eigenvectors. Check that the nonzero 
eigenvalues are the same as those in part a. 

(c) Obtain the singular-value decomposition of A. 

Verify the relationships V'?pV1? = © and p = (V"”)'E(v"/)"1, where © is the 

p X p population covariance matrix [Equation (2-32)], 9 is the p X p population cor- 

relation matrix [Equation (2-34)], and V'” is the population standard deviation matrix 

[Equation (2-35)]. 


Let X have covariance matrix 
40 0 
xZ=/;/0 9 0 
001 
Find 
(a) x 


(b) The eigenvalues and eigenvectors of &. 
(c) The eigenvalues and eigenvectors of £7. 


106 Chapter 2 Matrix Algebra and Random Vectors 


2.25. Let X have covariance matrix 


oF 2 
E=/+-2 41 
4 19 


(a) Determine p and V2, 
(b) Multiply your matrices to check the relation V'"pv'/ = . 
2.26. Use = as given in Exercise 2.25. 
(a) Find P13- 
(b) Find the correlation between x, and 5 LX, + 1X. 3- 
2.27. Derive expressions for the mean and variances of the following linear combinations in 
terms of the means and covariances of the random variables X,, X2, and X3. 
(a) X, — 2X2 
(b) ~X, + 3X2 
(c) x, + X2 + X3 
(e) xX, + 2X2 ors X3 
(f) 3X, — 4X, if X, and X2 are independent random variables. 
2.28. Show that 
Cov(cy 1%y + c12X2 Peet C1pXp, C21X1 ci C22X2 treet C2 pX p) = Ch xo 


where cf} = [cy1, C12,+6+, ¢1p] and ¢; = [c21, c22,..., c2p]. This verifies the off-diagonal 
elements CXC’ in (2-45) or diagonal elements if ¢; = c. 


Hint: By (2-43), Z; — E(Z;) = c1(X1 — B11) 5 C1 p(Xp = Hp) and 

Zz — E(Zz) = cg1(X1 — Wi) +00 + ca 9( Xp - ig) So Cov(Z;, Z2) = 

E((Z,; — E(Z,))(Z2 — E(Z2))] = E[(eis( 41 — wi) + 

“++ crp(Xp — Mp))(c21(X1 — Bi) + ¢22(X2 — Ha) +++ + C2p(Xp — Mp))]- 


The product 
(¢11(Xy — My) + cy2(X2 — 2) + °° 
+ cip(Xp — Mp)) (C21( Xi — Bi) + c22(X2 — wa) + +++ + Cop(Xp — Mp)) 


; (8 saci) (3 aetna) 


t=] m=1 


Pp 
>>) €1¢C2m( Xe — we) (Xm — Bm) 


ql at 


has expected value 
P 
> C1eCamO em = [e135---, 1p] Z[ca1,.--, czp]'- 
2=1 m=1 


Verify the last step by the definition of matrix multiplication. The same steps hold for all 
elements. 


2.29. 


2.30. 


2.31. 


Exercises !07 


Consider the arbitrary random vector X’ = [X), X2, X3, X,, X5] with mean vector 
#' = [M1, M2. M3, M4, Ms]. Partition X into 


where 


x) = BI and X@=| x, 
2 X, 
Let & be the covariance matrix of X with general element o,,. Partition £ into the 


covariance matrices of X!) and X(?) and the covariance matrix of an element of X“) 
and an element of X (2). 


You are given the random vector X’ = [X1, X2,X3, X4] with mean vector 
#x = (4,3, 2, 1] and variance-covariance matrix 


30 2 2 
01 1 #0 
2x=lo 1 9 -2 
20-2 4 
Partition X as 
x, 
_ | *2 x@ 
mts Be 
X, 


Let 
1 -2 
A=[1 2] and B~|) | 


and consider the linear combinations AX") and BX). Find 
(a) E(x) 

(b) E(AX")) 

(c) Cov(X")) 

(d) Cov(AX)) 

(e) E(X) 

(f) E(BX) 

(g) Cov(X) 

(h) Cov (BX) 

(i) Cov(X™, X@)) 

(j) Cov(AX”), BK®@) 

Repeat Exercise 2.30, but with A and B replaced by 


Ae Ai) aa B-(? a 


0B Chapter 2 Matrix Algebra and Random Vectors 


2.32. You are given the random vector X’ = [X,,X_,-.-,X5] with mean vector 


2.33. 


2.34. 


py = (2.4, —1, 3, 0] and variance-covariance matrix 


4-1 } =} 0 
~1 3 1-1 #0 
7 See) Sd) 6 Ad 
1 
5 -1 1 4 
0 0-1 0 2 
Partition X as 
xX, 
X) (1) 
x=|3¢|=|25 
3 x? 
X4 
Xs 
Let 


1 -1 11 1 
aa[} ot] ana e=[t | 


and consider the linear combinations AX") and BX). Find 
(a) E(x") 

(b) E(AX") 

(c) Cov(X")) 

(d) Cov(AX")) 

(e) E(X) 

(f) E(BX”) 

(g) Cov(X)) 

(h) Cov(BX")) 

(i) Cov(x), X) 
(j) Cov(AX"), BX) 


Repeat Exercise 2.32, but with X partitioned as 
x; 
x2] _ | xX 
=) % Ea 
xX 
Xs 


and with A and B replaced by 


2 -1 0 1 2 
aes 1 A and B= |} ol 
Consider the vectors b’ = [2,-1,4,0] andd’ = [—1,3, —2, 1]. Verify the Cauchy—Schwarz 
inequality (b'd)” < (b’b)(d’d). 


2.35. 


2.36. 


2.37. 
2.38. 


2.39. 


2.40. 


2.41. 


Exercises 109 


Using the vectors b’ = [4,3] and d’ = [1,1], verify the extended Cauchy—-Schwarz 
inequality (b’d)” <= (b'Bb)(d’B"'d) if 


B= 2 2 
—2 5 
Find the maximum and minimum values of the quadratic form 414 + 4x3 + 6xx2 for 


all points x’ = [x,, x2] such that x’x = 1. 


With A as given in Exercise 2.6, find the maximum value of x’ A x for x'x = 1. 


Find the maximum and minimum values of the ratio x’A x/x’x for any nonzero vectors 
x! = [x1, x2, x3] if 
13 -4 2 
A=|-4 13 ~2 
2 -2 10 
Show that 


B 
(rXs)(sXr)(txv) 


s f 
Chas (i, 7)th entry > > aie eK Ck j 

é=1 k=1 

f 
Hint: BC has (€, j)thentry 3) begce; = d,;. So A(BC) has (i, j)th element 
k=l 
7 s f s t 
@;14,; + aj2d2; + +--+ ajs dy; = > aie (= buce;) = >> p> a jeDeKCk j 


=1 


Verify (2-24): E(X + Y) = E(X) + E(Y) and E(AXB) = AEE: 
Hint: X + Y has X;; + Y;; as its (i, )th element. Now, E(X;; + Yi;) = E(X,,) + E(¥:)) 
by a univariate property of expectation, and this last quantity is the (i, 7) th element of 


E(X) + E(Y). Next (see Exercise 2.39), AXB has (i, /)th entry > > ajeX exbxj, and 
by the additive property of expectation, 


#(5 > aieX be =D OD aieE(X ex) be; 
€ “k tk 


which is the (i, /)th element of AE(X)B. 


You are given the random vector X’ = [X,, X2,X3, X,4] with mean vector 
# x = [3, 2, —2, 0] and variance—covariance matrix 
3 000 
03 00 
*x=/9 9 3.0 
00 0 3 
Let 
1-1 oO O 


(a) Find E (AX), the mean of AX. 
(b) Find Cov (AX), the variances and covariances of AX. 
(c) Which pairs of linear combinations have zero covariances? 


}10 Chapter 2 Matrix Algebra and Random Vectors 


2.42. Repeat Exercise 2.41, but with 
3111 
1 
a Sas : 3 F 
c 1113 
References 
1. Bellman, R. Introduction to Matrix Analysis (2nd ed.) Philadelphia: Soc for Industrial & 


Applied Math (SIAM), 1997. 


. Eckart, C., and G. Young. “The Approximation of One Matrix by Another of Lower 


Rank.” Psychometrika, 1 (1936), 211-218. 


. Graybill, F A. Introduction to Matrices with Applications in Statistics. Belmont, CA: 


Wadsworth, 1969. 


. Halmos, P. R. Finite- Dimensional Vector Spaces. New York: Springer-Verlag, 1993. 
. Johnson, R.A.,and G. K. Bhattacharyya. Statistics: Principles and Methods (5th ed.) New 


York: John Wiley, 2005. 


. Noble, B., and J. W. Daniel. Applied Linear Algebra (3rd ed.). Englewood Cliffs, NJ: 


Prentice Hall, 1988. 


Chapter 


SAMPLE GEOMETRY 
AND RANDOM SAMPLING 


3.1 Introduction 


With the vector concepts introduced in the previous chapter, we can now delve deeper 
into the geometrical interpretations of the descriptive statistics x, §,,, and R; we do so in 
Section 3.2. Many of our explanations use the representation of the columns of X as p 
vectors in n dimensions. In Section 3.3 we introduce the assumption that the observa- 
tions constitute a random sample. Simply stated, random sampling implies that (1) mea- 
surements taken on different items (or trials) are unrelated to one another and (2) the 
joint distribution of all p variables remains the same for all items. Ultimately, it is this 
structure of the random sample that justifies a particular choice of distance and dictates 
the geometry for the n-dimensional representation of the data. Furthermore, when data 
can be treated as a random sample, statistical inferences are based on a solid foundation. 

Returning to geometric interpretations in Section 3.4, we introduce a single 
number, called generalized variance, to describe variability. This generalization of 
variance is an integral part of the comparison of multivariate means. In later sec- 
tions we use matrix algebra to provide concise expressions for the matrix products 
and sums that allow us to calculate x and S,, directly from the data matrix X. The 
connection between x, S,,, and the means and covariances for linear combinations 
of variables is also clearly delineated, using the notion of matrix products. 


3.2 The Geometry of the Sample 


A single multivariate observation is the collection of measurements on p different 

variables taken on the same item or trial. As in Chapter 1, if n observations have 

been obtained, the entire data set can be placed in ann X p array (matrix): 

%11 412 ‘** Xp 

%21 X22 °° Xap 

(xp) : _ : 
Xn1 Xn2 *'* Xnp 


(12 


Cc 


papter 3 Sample Geometry and Random Sampling 


Each row of X represents a multivariate observation. Since the entire set of 
measurements is often one particular realization of what might have been 
observed, we say that the data are a sample of size n from a p-variate 
“population.” The sample then consists of n measurements, each of which has p 
components. S 

As we have seen, the data can be plotted in two different ways. For the 
p-dimensional scatter plot, the rows of X represent n points in p-dimensional 
space. We can write 7 


Xi Xy2 *°* Xp’ x} | 1st (multivariate) observation 
_ | X21 X22 +7 Xap] _ | ¥2 
ao ee, SR soa) 
Xnt %n2 *'' Xnp x,_| ‘nth (multivariate) observation 


The row vector x}, representing the jth observation, contains the coordinates of a- 
point. 

The scatter plot of points in p-dimensional space provides information on the 
locations and variability of the points. If the points are regarded as solid spheres, 
the sample mean vector x, given by (1-8), is the center of balance. Variability occurs 
in more than one direction, and it is quantified by the sample variance—covariance 
matrix S,,. A single numerical measure of variability is provided by the determinant 
of the sample variance—covariance matrix. When p is greater than 3, this scatter 
plot representation cannot actually be graphed. Yet the consideration of the data 
as n points in p dimensions provides insights that are not readily available from 
algebraic expressions. Moreover, the concepts illustrated for p = 2 or p = 3 remain 
yalid for the other cases. 


Example 3.1 (Computing the mean vector) Compute the mean vector X from the 
data matrix. 


Plot the n = 3 data points in p = 2 space, and locate x on the resulting diagram. 
The first point, x,, has coordinates x, = [4,1]. Similarly, the remaining two 
points are x} = [—1, 3] and x} = [3,5]. Finally, 


The Geometry of the Sample 113 


Figure 3.1 A plot of the data 
matrix X asn = 3 pointsin p = 2 
space. 


Figure 3.1 shows that x is the balance point (center of gravity) of the scatter 
plot. = 


The alternative geometrical representation is constructed by considering the 
data as p vectors in n-dimensional space. Here we take the elements of the columns 
of the data matrix to be the coordinates of the vectors. Let 


X11 X12 °°" Xp 5 
%21 422 °° Xp ' 2 
: ss pelle mad rie t Yp] (3-2) 
("Xp) * . . . [ y. H D. 
Xny %Xn2 “°° Xnp 
Then the coordinates of the first point y; = [x,), X%21,...,%,1] are the m measure- 
ments on the first variable. In general, the ith point y} = (x;;,%;,.--»2ni] is 


determined by the n-tuple of all measurements on the ith variable. In this geo- 
metrical representation, we depict y,,..., yp as vectors rather than poirits, as in the 
p-dimensional scatter plot. We shall be manipulating these quantities shortly using 
the algebra of vectors discussed in Chapter 2. 


Example 3.2 (Data as p vectors in n dimensions) Plot the following data as p = 2 
vectors in n = 3 space: 


114 Chapter 3 Sample Geometry and Random Sampling 


Figure 3.2 A plot of the data 
matrix X as p = 2 vectors in 
n = 3 space. 


Here y; = [4, —1,3] and y = [1, 3, 5]. These vectors are shown in Figure 3.2. my 


Many of the algebraic expressions we shall encounter in multivariate analysis 
can be related to the geometrical notions of length, angle, and volume. This is im- 
portant because geometrical representations ordinarily facilitate understanding and 
lead to further insights. 

Unfortunately, we are limited to visualizing objects in three dimensions, and 
consequently, the n-dimensional representation of the data matrix X may not seem 
like a particularly useful device for n > 3. It turns out, however, that geometrical 
relationships and the associated statistical concepts depicted for any three vectors 
remain valid regardless of their dimension. This follows because three vectors, even if 
n dimensional, can span no more than a three-dimensional space, just as two vectors 
with any number of components must lie in a plane. By selecting an appropriate 
three-dimensional perspective—that is, a portion of the n-dimensional space con- 
taining the three vectors of interest—a view is obtained that preserves both lengths 
and angles, Thus, it is possible, with the right choice of axes, to illustrate certain alge- 
braic statistical concepts in terms of only two or three vectors of any dimension n. 
Since the specific choice of axes is not relevant to the geometry, we shall always 
label the coordinate axes 1,2, and 3. ; 

It is possible to give a geometrical interpretation of the process of finding a sam- 
ple mean. We start by defining the n X 1 vector I, = [1,1,..., 1]. (To simplify the 
notation, the subscript 1 will be dropped when the dimension of the vector I, is 
clear from the context.) The vector 1 forms equal angles with each of the n 
coordinate axes, so the vector (1/ Vn)i has unit length in the equal-angle direction. 
Consider the vector y; = [x1;, X2i,---, X,;]. The projection of y; on the unit vector 


(1/va)1 is, by (2-8), 
1 1 Xp t Xp tot Xj ~ 
' 1 —_ 1 — ee isisy = . 4 
u(t ) Vn n ia cy) 


That is, the sample mean X; = (x1; + x2; +--+ + x,;)/m = y}1/n corresponds to the 
multiple of 1 required to give the projection of y, onto the line determined by 1. 


The Geometry of the Sample 1/5 


Further, for each y;, we have the decomposition 


yi xl 
0 1 x1 


where x,1 is perpendicular to y; — x;1. The deviation, or mean corrected, vector is 


Xj — Xi 
d;=y,-z4=| 7. (3-4) 
Xnji — Xj 


The elements of d; are the deviations of the measurements on the ith variable from 
their sample mean. Decomposition of the y; vectors into mean components and 
deviation from the mean components is shown in Figure 3.3 for p = 3 and n = 3. 


3 


Figure 3.3 The decomposition 
of y; into a mean component 
X;1 and a deviation component 
d; = yi ~- xX;1,i = 1, 2, 3. 


Example 3.3 (Decomposing a vector into its mean and deviation components) Let 
us carry out the decomposition of y; into X;1 and d; = y; ~ ¥1,i = 1,2, for the data 
given in Example 3.2: 


4 1 
X =| -1 3 
3 5 


1 3 1 3 
¥1=2/1)/=]2 ¥1 =3/1]=|3 
1 2 1 3 


116 Chapter 3 Sample Geometry and Random Sampling 


Consequently, 
4 2 2 
d,=y, — 41 =] -1])-|2)=] -3 
3 2 1 

and 

1 3 =2. 
dp = y2~ X%1=|3|~-[3}=] 0 
: 5 L3 2 


We note that X,1 and d; = y, — Xl are perpendicular, because 


2 
(%11)'(y, — X11) = [2 2 2]{ -3|=4-6+2=0 
: 1 


A similar result holds for X,1 and d, = y. — X21. The decomposition is 


4 2 2 
y= ~1)/=;2]4+)|-3 
3 2 1 
1 3 -2 
y=]3}=]3)}+] 0 
5 3 2 = 


For the time being, we are interested in the deviation (or residual) vectors 
d; = y; — %,1. A plot of the deviation vectors of Figure 3.3 is given in Figure 3.4. 


a, 


; Figure 3.4 The deviation 
a; vectors d; from Figure 3.3. 


The Geometry of the Sample 117 


We have translated the deviation vectors to the origin without changing their lengths 
or Orientations. 

Now consider the squared lengths of the deviation vectors. Using (2-5) and 
(3-4), we obtain 


= did; = > (xj; _- 5 (3-5) 
(Length of deviation aise = sum of squared deviations 


From (1-3), we see that the squared length is proportional to the variance of 
the measurements on the ith variable. Equivalently, the Jength is proportional to 
the standard deviation. Longer vectors represent more variability than shorter 
vectors. 

For any two deviation vectors d; and d,, 


d; (ay = — > (Xj - Xi Nes a Xx) (3-6) 


Let 6;, denote the angle formed by the vectors d; and d,. From (2-6), we get 
did, Fa LaLa, cos (9;x) 


or, using (3-5) and (3-6), we obtain 


Den — X;)(xjn — Xx) = Ve (x; — %)° V/ > (xjx — ¥4)° cos (8;x) 


so that [see (1-5)] 


lin = Ea" = cos (6;,) (3-7) 


The cosine of the angle is the sample correlation coefficient. Thus, if the two 
deviation vectors have nearly the same orientation, the sample correlation will be 
close to 1. If the two vectors are nearly perpendicular, the sample correlation will 
be approximately zero. If the two vectors are oriented in nearly opposite directions, 
the sample correlation will be close to —1. 


Example 3.4 (Calculating S, and R from deviation vectors) Given the deviation vec- 
tors in Example 3.3, let us compute the sample variance~covariance matrix S,, and 
sample correlation matrix R using the geometrical concepts just introduced. 

From Example 3.3, 


2 —2 
d, =| -3] and d= 0 
1 2 


118 Chapter 3 Sample Geometry and Random Sampling 


3 


Figure 3.5 The deviation vectors 
dy and dy. 


These vectors, translated to the origin, are shown in Figure 3.5. Now, 


2 
djd, = [2 -3 1)| -3 | = 14=3m1 
1 
Or Sj, = 8 Also, 
—2 
djd.=[-2 0 2]) 0] =8 = 32 
2 
OF S27 = . Finally, 
-2 
did, = {2 -3 1] 0)=-2= 3512 
2 
OF $12 = -3, Consequently, 
a -2 
12 3 
= =—> 3% = —-189 
"12° Vn Vin Bf 
and 


Random Samples and the Expected Values of the Sample Mean and Covariance Matrix | 19 


The concepts of length, angle, and projection have provided us with a geometrical 
interpretation of the sample. We summarize as follows: 


Geometrical Interpretation of the Sample 


1. The projection of a column y; of the data matrix KX onto the equal angular 
vector 1 is the vector x;1. The vector ¥,1 has length Vn | x; |. Therefore, the 
ith sample mean, x;, is related to the length of the projection of y; on 1. 


2. The information comprising S,, is obtained from the deviation vectors d; = 
ye ~ Xl = [x1; — %;, Xo; — Xj,-..,X_i — ¥,]'. The square of the length of d; 
is ns;;, and the (inner) product between d; and d, is ns;,.} 

3. The sample correlation 7;, is the cosine of the angle between d; and dx. 


3.3 Random Samples and the Expected Values of 
the Sample Mean and Covariance Matrix 


In order to study the sampling variability of statistics such as x and S,, with the ulti- 
mate aim of making inferences, we need to make assumptions about the variables 
whose observed values constitute the data set X. 

Suppose, then, that the data have not yet been observed, but we intend to collect 
n sets of measurements on p variables. Before the measurements are made, their 
values cannot, in general, be predicted exactly. Consequently, we treat them as ran- 
dom variables. In this context, let the (j,k)-th entry in the data matrix be the 
random variable X;,. Each set of measurements X; on p variables is a random vec- 
tor, and we have the random matrix 


Xi, Xi2 «> Mp Xj 
X =| 21 22 0° Xen} | ® (3-8) 
(nxp) : : they : 

Xn Xn2 ee, Xnp Xi 


A random sample can now be defined. 

If the row vectors X}, X5,..., X, in (3-8) represent independent observations 
from a common joint distribution with density function f(x) = f(x1, x2,..-4Xp)s 
then X,, X2,..., X,, are said to form a random sample from f(x). Mathematically, 
X,, X),..., X,, form a random sample if their joint density function is given by the 
product f(x;)f(x2)---f(x,), where f(x;) = f(xj1, j2,---,X;p) is the density func- 
tion for the jth row vector. 

Two points connected with the definition of random sample merit special attention: 

1. The measurements of the p variables in a single trial, such as Xj; = 

[Xj1, Xj2.--., Xjp], will usually be correlated. Indeed, we expect this to be the 

case. The measurements from different trials must, however, be independent. 


1 The square of the length and the inner product are (n — 1)s;; and (m — 1)s;,, respectively, when 
the divisor n — 1 is used in the definitions of the sample variance and covariance. 


120 Chapter 3 Sample Geometry and Random Sampling 


2. The independence of measurements from trial to tria) may not hold when the 
variables are likely to drift over time, as with sets of p stock prices or p eéco- 
nomic indicators. Violations of the tentative assumption of independence can 
have a serious impact on the quality of statistical inferences. 


The following examples illustrate these remarks. 


Example 3.5 (Selecting a random sample) As a preliminary step in designing a 
permit system for utilizing a wilderness canoe area without overcrowding, a natural- 
resource manager took a survey of users. The total wilderness area was divided into 
subregions, and respondents were asked to give information on the regions visited, 
lengths of stay, and other variables. 

The method followed was to select persons randomly (perhaps using a random 
number table) from all those who entered the wilderness area during a particular 
week. All persons were equally likely to be in the sample, so the more popular 
entrances were represented by larger proportions of canoeists. 

Here one would expect the sample observations to conform closely to the crite- 
rion for a random sample from the population of users or potential users. On the 
other hand, if one of the samplers had waited at a campsite far in the interior of the 
area and interviewed only canoeists who reached that spot, successive measurements 
would not be independent. For instance, lengths of stay in the wilderness area for dif- 
ferent canoeists from this group would all tend to be large. = 


Example 3.6 (A nonrandom sample) Because of concerns with future solid-waste 
disposal, an ongoing study concerns the gross weight of municipal solid waste gen- 
erated per year in the United States (Environmental Protection Agency). Estimated 
amounts attributed to x; = paper and paperboard waste and x, = plastic waste, in 
millions of tons, are given for selected years in Table 3.1. Should these measure- 
ments on X’ = [X1, X2] be treated as a random sample of size n = 7? No! In fact, 
except for a slight but fortunate downturn in paper and paperboard waste in 2003, 
both variables are increasing over time. 


Table 3.1 Solid Waste 
Year 1960 1970 1980 1990 1995 2000 2003 


e (paper) 29.2 44.3 552 727 817 87.7 83.1 


X> (plastics) 4 2.9 68 171 189 247 267 
t 


As we have argued heuristically in Chapter 1, the notion of statistical indepen- 
dence has important implications for measuring distance. Euclidean distance appears 
appropriate if the components of a vector are independent and have the same vari- 
ances. Suppose we consider the location of the Ath column Y¥; = [X1,, Xox.---s Xne] 
of X, regarded as a point in n dimensions. The location of this point is determined by 
the joint probability distribution f(y,) = f(%1,, X24,---,Xn,). When the measure- 
ments X1,,X2%,---,Xnx are a random sample, f(y.) = f(x1_,X24—.--+) Ink) = 
Fi 1a) xon)** Sing) and, consequently, each coordinate x,, contributes equally 
to the location through the identical marginal distributions f,(x;,). 


Random Samples and the Expected Values of the Sample Mean and Covariance Matrix 121 


If the n components are not independent or the marginal distributions are not 
identical, the influence of individual measurements (coordinates) on location is 
asymmetrical. We would then be led to consider a distance function in which the 
coordinates were weighted unequally, as in the “statistical” distances or quadratic 
forms introduced in Chapters 1 and 2. 

Certain conclusions can be reached conceming the sampling distributions of X 
and §,, without making further assumptions regarding the form of the underlying 
joint distribution of the variables. In particular, we can see how X and S,, fare as point 
estimators of the corresponding population mean vector yz and covariance matrix Z. 


Result 3.1. Let X,, X2,..., X,, be a random sample from a joint distribution that 
has mean vector yz and covariance matrix . Then X is an unbiased estimator of p, 
and its covariance matrix is 


ts 

n 
That is, 

E(X) =p (population mean vector) 
1 population variance—covariance matrix : 
Cov(X) = 72 divided by sample size Ca 
For the covariance matrix S,,, 
E(S 
Thus, 
e(—"8,)=3 (3-10) 
n-1 


so [n/(n — 1)]S,, is an unbiased estimator of X, while S,, is a biased estimator with 
(bias) = E(S,) — & = -(1/n)®. 


Proof. Now, X = (X; + X2 +--- + X,,)/n. The repeated use of the properties of 
expectation in (2-24) for two vectors gives 


_ 1 1 1 
E(X) = z(4x, + ec) tee t +x, 


~£(4x) + (Lm) +4 2(2x *) 


E(X,) + = E(%2) +--+ SE(K,) = 


Next, 


= SDS - we - wy 


122 Chapter 3 Sample Geometry and Random Sampling 


so . 


Cow(X) = E(X ~ w)(K - wy’ =5( SD (XK, -— a) (Xe - ps)’ 
n \j=1 €=1 


For j # @, each entry in E(X; — w)(Xe — m)' is zero because the entry is the 
covariance betweén a component of X; and a component of X¢, and these are 
independent. [See Exercise 3.17 and (2-29).] 

Therefore, 


Cov(X) = 2 (3 20x, ~ p)(X; - 2)') 


Since Y = E(X; — #)(X; — mw)’ is the common population covariance matrix for 
each X;, we have 


Cov(X) = +(3 E(X; — »)(X; - yes > (Sh +B +--+ +B) 
-_ 2S 


nterms 
1 1 
_ 2 i") = (4): 


To obtain the expected value of S,,, we first note that (X,; — X;)(Xjx — Xx) is 
the (i,k)th element of (X; — X)(X; — X)'. The matrix representing sums of 
squares and cross products can then be written as 


S (x; ~ X) Mey X)'= 3 0%) - X)X; + (30%, - ®) Ay 
= 


jai 


‘since }) (X; -— X) = 0 and nX’ = 5) X‘. Therefore, its expected value is 
y=l i=] 


n 


E ( 3 XX) - nXX' ) = 3 E(X)X}) — nE(EX’) 
j=l J=1 


For any random vector V with E(V) = wy and Cov(V) = Xy, we have E(VV’) = 
Xv + myer. (See Exercise 3.16.) Consequently, 


E(X;X}) = 2% + mp’ and E(XX') = “x + py’ 
Using these results, we obtain 


zt zanees 1 
2 E(X;Xj) — nE(XX’) = n& + np’ — n(ix + pn’) =(n- 1) 
and thus, since S,, = (1/7) e XX} - n&X’), it follows immediately that 
fA 


E(S,) = “Px 


Generalized Variance 123 


Result 3.1 shows that the (i, &)th entry, (m — 1)? S (X;; — X;) (Xjx — Xx), of 
j=l 


[n/(m — 1)]S,, is an unbiased estimator of o;,. However, the individual sample stan- 
dard deviations Vs, calculated with either n or n — 1 as a divisor, are not unbiased 
estimators of the corresponding population quantities Va;;. Moreover, the correla- 
tion coefficients r;, are not unbiased estimators of the population quantities pjx. 
However, the bias E( Vsji) — Voj;, or E(r;~) — pix, can usually be ignored if the 
sample size n is moderately large. 

Consideration of bias motivates a slightly modified definition of the sample 
variance~covariance matrix. Result 3.1 provides us with an unbiased estimator S of &: 


(Unbiased) Sample Variance-Covariance Matrix 


= 7 = 1 . =e é a) - 
= (- = -)s, = 2» (%) X)(X,; — X) (3-11) 


n~-1 


Here S, without a subscript, has (i, k)th entry (n — 1) S) (Xj; ~ X;) (Xjx — Xx): 
j=l 


sae . . Ue . . 
This definition of sample covariance is commonly used in many multivariate test 
statistics. Therefore, it will replace S,, as the sample covariance matrix in most of the 
material throughout the rest of this book. 


3.4 Generalized Variance 


With a single variable, the sample variance is often used to describe the amount of 
variation in the measurements on that variable. When p variables are observed on 
each unit, the variation is described by the sample variance—covariance matrix 


Siz S120 St pp 
n 
_ | 512 S22 77° Sap | _ vod 2 = 
SS Pa a? eo Pie my Dy (Bin — HA) je ~ Fe) 
: : ae n | 
Sip S2p °"* Spp 


The sample covariance matrix contains p variances and i p(p — 1) potentially 
different covariances. Sometimes it is desirable to assign a single numerical value for 
the variation expressed by S. One choice for a value is the determinant of S, which 
reduces to the usual sample variance of a single characteristic when p = 1. This 
determinant’? is called the generalized sample variance: 


Generalized sample variance = | §| (3-12) 


2 Definition 2A.24 defines “determinant” and indicates one method for calculating the value of a 
determinant. 


124 Chapter 3 Sample Geometry and Random Sampling 


Example 3.7 (Calculating a generalized variance) Employees (x,) and profits per 
employee (x2) for the 16 largest publishing firms in the United States are shown in 
Figure 1.3. The sample covariance matrix, obtained from the data i in the April 30, 
1990, Forbes magazine article, is 


ji 5 . | 252.04 —68.43 
~68.43 123.67 


Evaluate the generalized variance. 
In this case, we compute 


|S| = (252.04) (123.67) — (—68.43)(-68.43) = 26,487 Y] 


The generalized sample variance provides one way of writing the information 
on all variances and covariances as a single number. Of course, when p > 1, some 
information about the sample is lost in the process. A geometrical interpretation of 
|S| will help us appreciate its strengths and weaknesses as a descriptive summary, 

Consider the area generated within the plane by two deviation vectors 
d; = y, — x1 and d, = y, — X21. Let Lg, be the length of d, and Lg, the length of 
d,. By elementary geometry, we have the diagram 


and the area of the trapezoid is | La, sin(@) |Lq,. Since cos*(@) + sin’(@) = 1, we can 


express this area as 
Area = LaLa,V1 — cos*(6) 
From (3-5) and (3-7), 


> (an -%) = Vin- 1s 
= 


La = > (xj2 - %)° = V(n — 1)522 


j=l 
and 
cos(@) = 742 
Therefore, 
Area = (n = 1) V S11 V 522 V1 - r?, = (n fad 1) V 51 1822(1 ms ria) G-13) 
Also, 


|S 


Sir Siz J] 5} V511 V522r12 
Siz 522 V511 V 522712 522 


= 511822 ~ 8118227 }2 = $11522(1 — 7) (3-14) 


Generalized. Variance !25 


(a) (b) 


Figure 3.6 (a) “Large” generalized sample variance for p = 3. 
(b) “Small” generalized sample variance for p = 3. 


If we compare (3-14) with (3-13), we see that 
|S| = (area)*/(n — 1)? 


Assuming now that |S| = (n — 1)~?-) (volume)? holds for the volume gener- 
ated in n space by the p — 1 deviation vectors dy, d2,...,dp-1, we can establish the 
following general result for p deviation vectors by induction (see [1], p. 266): 


Generalized sample variance = |S| = (n — 1)"?(volume)? (3-15) 


Equation (3-15) says that the generalized sample variance, for a fixed set of data, is 
proportional to the square of the ee generated by the p deviation vectors 

d) = y, — X11, dp = y2 — X21,.- = y, — Xp1. Figures 3.6(a) and (b) show 
trapezoidal regions, generated by pas = = residual vectors, corresponding to “large” 

and “small” generalized variances. 

For a fixed sample size, it is clear from the geometry that volume, or |S |, will 
increase when the length of any d; = y; — X¥,1 (or Vs;)) is increased. In addition, 
volume will increase if the residual vectors of fixed length are moved until they are 
at right angles to one another, as in Figure 3.6(a). On the other hand, the volume, 
or | § |, will be small if just one of the 5;; is small or one of the deviation vectors lies 
nearly in the (hyper) plane formed by the others, or both. In the second case, the 
trapezoid has very little height above the plane. This is the situation in Figure 3.6(b), 
where d; lies nearly in the plane formed by d, and dp. 


3 If generalized variance is defined in terms of the sample covariance matrix S, = [(m — 1)/n]S, then, 
using Result 2A.11,/S,] = |[(n — 1)/n]I,S| = |[(n — 1)/n]l,||$| = [(n — 1)/n}?|S|. Consequently, 
using (3-15), we can also write the following: Generalized sample variance = |S,| = n”?(volume)’. 


126 Chapter 3 Sample Geometry and Random Sampling 


Generalized variance also has interpretations in the p-space scatter plot representa. 
tion of the data. The most intuitive interpretation concerns the spread of the scatter 
about the sample mean point x’ = [X,, X),..., Xp]. Consider the measure of distance- 
given in the comment below (2-19), with x playing the role of the fixed point »x and S$ 
playing the role of A. With these choices, the coordinates x’ = [x), X2,.-., Xp] Of the 
points a constant distance c from X satisfy 


(x — X)'S"(x- x) =¢ (3-16) 


[When p = 1, (x — x)/S'(x — x) = (x, - %)/s1 is the squared distance from x, 
to X, in standard deviation units.] 

Equation (3-16) defines a hyperellipsoid (an ellipse if p = 2) centered at x. It 
can be shown using integral calculus that the volume of this hyperellipsoid is related 
to| § |. In particular, 


Volume of {x: (x — x)'S'(x - x) <c?} = k,|S|'c? (3-17) _ 
or 


(Volume of ellipsoid)? = (constant) (generalized sample variance) 


where the constant k, is rather formidable.‘ A large volume corresponds to a large 
generalized variance. 

Although the generalized variance has some intuitively pleasing geometrical 
interpretations, it suffers from a basic weakness as a descriptive summary of the 
sample covariance matrix S, as the following example shows. 


Example 3.8 (Interpreting the generalized variance) Figure 3.7 gives three scatter 
plots with very different patterns of correlation. 
All three data sets have x’ = [2, 1], and the covariance matrices are 


eo cee NSP AON ce cuct 28. Af ee SN a2) 
al, ‘Jr s-|? oreo s-[ 3 ‘| : 


Each covariance matrix § contains the information on the variability of the 
component variables and also the information required to calculate the correla- 
tion coefficient. In this sense, S captures the orientation and size of the pattern 
of scatter. _ 

The eigenvalues and eigenvectors extracted from § further describe the pattern 
in the scatter plot. For 


5 4 ; ag OF A-S5P-4 
S —— 
b the eigenvalues satisfy =(A~9)(A -1) 


‘ For those who are curious, k p = 2n// pT p/2), where F(z) denotes the gamma function evaluated 
at z. 


Generalized Variance (27 


Figure 3.7 Scatter plots with three different orientations. 


and we determine the eigenvalue—eigenvector pairs A, = 9, e; = (1/v2, 1/ v2] and 
Ag = 1, e) = [1/V2, -1/V2]. 


The mean-centered ellipse, with center x’ = [2,1] for all three cases, is 
(x — x)'S7(x — x) = c? 


To describe this ellipse, as in Section 2.3, with A = S71, we notice that if (A, e) is an 
eigenvalue-eigenvector pair for S, then (A“}, e) is an eigenvalue-eigenvector pair for 
S"'. That is, if Se = Ae, then multiplying on the left by S~! gives S1Se = AS“'e, or 
Se =A 1e. Therefore, using the eigenvalues from S, we know that the ellipse 
extends c VA; in the direction of e; from x. 


128 


Chapter 3 Sample Geometry and Random Sampling 


In p = 2 dimensions, the choice c? = 5.99 will produce an ellipse that contains 
approximately 95% of the observations. The vectors 3V5.99 e; and V5.99 e2 are 
drawn in Figure 3.8(a). Notice how the directions are the natural axes for the ellipse, 
and observe that the lengths of these scaled eigenvectors are comparable to the size 
of the pattern in each direction. 

Next, for : 


0 
S= E I the eigenvaluessatisfy 0= (A — 3)? 


and we arbitrarily choose the eigenvectors so that A, = 3,e, = [1, O] and Az = 3, 
e, =[0, 1]. The vectors V3 V5.99 e; and V3 V5.99 e, are drawn in Figure 3.8(b), 


Figure 3.8 Axes of the mean-centered 95% ellipses for the scatter plots in 
Figure 3.7. 


Generalized Variance 129 


Finally, for 


0 = (A — 5)? — (-4)? 
= (A—9)(A- 1) 


and we determine the eigenvalue-eigenvector pairs Ay = 9,e, = (1/2, —1/¥V2] and 
d2 = 1, e&& =[1/V2, 1/2]. The scaled eigenvectors 3V/5.99 e; and V5.99 e, are 
drawn in Figure 3.8(c). 

In two dimensions, we can often sketch the axes of the mean-centered ellipse by 
eye. However, the eigenvector approach also works for high dimensions where the 
data cannot be examined visually. 

Note: Here the generalized variance |§| gives the same value, |S| = 9, for all 
three patterns. But generalized variance does not contain any information on the 
orientation of the patterns. Generalized variance is easier to interpret when the two 
or more samples (patterns) being compared have nearly the same orientations. 

Notice that our three patterns of scatter appear to cover approximately the 
same area. The ellipses that summarize the variability 


(x —x)'S1(x - x) =¢ 


do have exactly the same area [see (3-17)], since all have |S| = 9. = 


5 -4 
S= -& ‘] , the eigenvalues satisfy 


As Example 3.8 demonstrates, different correlation structures are not detected 
by | S|. The situation for p > 2 can be even more obscure. . 

Consequently, it is often desirable to provide more than the single number {S| 
as a summary of S. From Exercise 2.12, |S| can be expressed as the product 
A\A2°++ A, of the eigenvalues of S. Moreover, the mean-centered ellipsoid based on 
S™ [see (3-16)] has axes. whose lengths are proportional to the square roots of the 
A/’s (see Section 2.3). These eigenvalues then provide information on the variability 
in all directions in the p-space representation of the data. It is useful, therefore, to 
report their individual values, as well as their product. We shall pursue this topic 
later when we discuss principal components. 


Situations in which the Generalized Sample Variance Is Zero 


The generalized sample variance will be zero in certain situations. A generalized 
variance of zero is indicative of extreme degeneracy, in the sense that at least one 
column of the matrix of deviations, 


! x = = 
xX; — x X11 — X1 %12 ~ XQ °° Xip~ Xp 
’ =a — — ] 
X2— xX os X2y ~ Xy 2X22 7 XQ -°"* X2p — Xp 
oe = os 2 
Xp — xX’ Xny ~ X1 Xn2— Xz. +> Anp — Xp 
= X- 1 * (3-18) 


(xp) (2x1)(1xp) 
can be expressed as a linear combination of the other columns. As we have shown 
geometrically, this is a case where one of the deviation vectors—for instance, d} = 
[x1; — Xj,---,X%,i — %;)—lies in the (hyper) plane generated by dj,...,4)-1, 
djii,---, 4). 


i130 Chapter 3 Sample Geometry and Random Sampling 


Result 3.2. The generalized variance is zero when, and only when, at least one de- 
Vjation vector lies in the (hyper) plane formed by all linear combinations of the 
others—that is, when the columns of the matrix of deviations in (3-18) are linearly 


dependent. 


Proof. If the columns of the deviation matrix (X - 1x’) are linearly dependent, 
there is a linear combination of the columns such that 


0 = a, col,(XK — 1%’) + --- + aycol,(X — 1%’) 

(X - 1%')a_—forsome a # 0 

But then, as you may verify, (n — 1)S = (X — 1%')'(X — 1x’) and 
(n — 1)Sa = (X — 1%')'(K - ix’)a = 0 


i 


50 the same a corresponds to a linear dependency, a; col\(S) + --- + apcol,(S) = 
Sa = 0, in the columns of S. So, by Result 2A.9,|S| = 0. 

In the other direction, if |$| = 0, then there is some linear combination Sa of the 
columns of $ such that Sa = 0. That is, 0 = (n ~ 1)Sa = (KX — 1x’)'(X — Ix')a. 
Premultiplying by a’ yields 

0 = a'(X — 1x)’ (X - 1z')a = Liy-awya 


and, for the length to equal zero, we must have (X - 1x’)a = 0. Thus, the columns 
of (X — 1’) are linearly dependent. ie 


Example 3.9 (A case where the generalized variance is zero) Show that | S| = 0 for 


12 5 
X =/4 1 6 
(3x3) 
404 
and determine the degeneracy. 
Here x’ = [3, 1,5], so 
aplos 21 s=5 2 1 
XK -1% =|4-3 1-1 6-S]/=] 1 0 1 
4-3 0-1 4-5 = eres | 
The deviation (column) vectors are dj =[-2,1,1], dj; =[1,0,~—1], and 


ds, = [0, 1, —1]. Since d3 = d, + 2d2, there is column degeneracy. (Note that there 
is row degeneracy also.) This means that one of the deviation vectors—for example, 
d,—lies in the plane generated by the other two residual vectors. Consequently, the 
three-dimensional volume is zero. This case is illustrated in Figure 3.9 and may be 
verified algebraically by showing that |S| = 0. We have 


Generalized Variance 131 


Figure 3.9 A case where the 
three-dimensional volume is zero 


(|S| = 0). 
and from Definition 2A.24, 
1 } 4 a3) ~32 4 y 
os, 2| ¢_1)2 _3 2 217_1)3 2 of 
(S1= 3), t}(-0 + (3) g g]MP +! 9 af(-2) 
= 1 3 3 _~9_9_ 
=3 (hg) PG) (+240) p0=f—7=0 7 


When large data sets are sent and received electronically, investigators are 
sometimes unpleasantly surprised to find a case of zero generalized variance, so that 
S does not have an inverse. We have encountered several such cases, with their asso- 
ciated difficulties, before the situation was unmasked. A singular covariance matrix 
occurs when, for instance, the data are test scores and the investigator has included 
variables that are sums of the others. For example, an algebra score and a geometry 
score could be combined to give a total math score, or class midterm and final exam 
scores summed to give total points. Once, the total weight of a number of chemicals 
was included along with that of each component. 

This common practice of creating new variables that are sums of the original 
variables and then including them in the data set has caused enough Iost time that 
we emphasize the necessity of being alert to avoid these consequences. 


Example 3.10 (Creating new variables that lead to a zero generalized variance) 
Consider the data matrix 


1 9 10 
4 12 16 
X=/2 10 12 
5 8 13 
3 11 14 


where the third column is the sum of first two columns. These data could be the num- 
ber of successful phone solicitations per day by a part-time and a full-time employee, 
respectively, so the third column is the total number of successful solicitations per day. 

Show that the generalized variance |S| = 0, and determine the nature of the 
dependency in the data. 


132. Chapter 3 Sample Geometry and Random Sampling 


We find that the mean corrected data matrix, with entries x;, — X,, is 


-2 -1 ~3 
1 2 3 
X - 1x’ =}]-1 O -1 
= 2-2 0 
0 1 1 
The resulting covariance matrix is 
_ 125 0 25} 
S=|0 25 25 
25 25 50 


We verify that, in this case, the generalized variance 
[S| =29°x5+0+0-257 -25%-0=0 


In general, if the three columns of the data matrix X satisfy a linear constraint 
@)Xj1 + a2X;2 + a3x;3 = c, aconstant forall j, then aX, + a,X.+ a3X%3 = c, so that 


44 (Xj) — ¥1) + Go(x,2 — %) + a3(xj3 ~ X3) = O 


for all j. That is, 
(X - ix')a=0 


and the columns of the mean corrected data matrix are linearly dependent. Thus, the 
inclusion of the third variable, which is linearly related to the first two, has led to the 
case of a zero generalized variance. 

com Whenever the columns of the mean corrected data matrix are linearly dependent, 


(n — 1)Sa = (X - 1x’)'(K - 1X’)a = (X — 1%’)0 = 0 


and Sa = 0 establishes the linear dependency of the columns of S. Hence,/S{ = 0. 
Since Sa = 0 = Oa, we see that ais a scaled eigenvector of S associated with an 
eigenvalue of zero. This gives rise to an important diagnostic: If we are unaware of 
any extra variables that are linear combinations of the others, we can find them by 
calculating the eigenvectors of § and identifying the one associated with a zero 
eigenvalue. That is, if we were unaware of the dependency in this example, a com- 
puter calculation would find an eigenvalue proportional to a’ = [1,1, —1], since 


25 0 25 1 0 1 
Sa=|0 25 25 1)=|/0/=0) 1 
2.5 2.5 50|[L-1 0 -1 


The coefficients reveal that 

1(x;1 — X1) + (xj -— %) + (-N (443 — X3) = 0 forall] 
In addition, the sum of the first two variables minus the third is a constant c for all n 
units. Here the third variable is actually the sum of the first two variables, so the 
columns of the original data matrix satisfy a linear constraint with c = 0. Because 
we have the special case c = 0, the constraint establishes the fact that the columns 
of the data matrix are linearly dependent. = 


Generalized Variance 133 


Let us summarize the important equivalent conditions for a generalized vari- 
ance to be zero that we discussed in the preceding example. Whenever a nonzero 
vector a satisfies one of the following three conditions, it satisfies all of them: 


(1) Sa=0 (2) a’(x; — x) =Oforallj (3) a’x; = forallj(c = a‘x) 
—-— —_——_ —————— 
ais a scaled The linear combination The linear combination of 
eigenvector of S of the mean corrected the original data, using a, 
with eigenvalue 0. _—_ data, using a, is zero. is a constant. 


We showed that if condition (3) is satisfied—that is, if the values for one variable 
can be expressed in terms of the others—then the generalized variance is zero 
because S has a zero eigenvalue. In the other direction, if condition (1) holds, 
then the eigenvector a gives coefficients for the linear dependency of the mean 
corrected data. 

In any statistical analysis, |S| = 0 means that the measurements on some vari- 
ables should be removed from the study as far as the mathematica] computations 
are concerned. The corresponding reduced data matrix will then lead to a covari- 
ance matrix of full rank and a nonzero generalized variance. The question of which 
measurements to remove in degenerate cases is not easy to answer. When there is a 
choice, one should retain measurements on a (presumed) causal variable instead of 
those on a secondary characteristic. We shall return to this subject in our discussion 
of principal components. 

At this point, we settle for delineating some simple conditions for § to be of full 
rank or of reduced rank. 


Result 3.3. If nm = p, that is, (sample size) < (number of variables), then |S| = 0 
for all samples. 


Proof. We must show that the rank of § is less than or equal to p and then apply 
Result 2A.9. 

For any fixed sample, the n row vectors in (3-18) sum to the zero vector. The 
existence of this linear combination means that the rank of X — 1%’ is less than or 
equal to n — 1, which, in turn, is less than or equal to p — 1 because n = p. Since 

(n-1) § =(X- 1x)'(X =) 
(pXp) (pXa) (nXp) 
the Ath column of S, col,(S), can be written as a linear combination of the columns 
of (X — 1%')’. In particular, 


(n — 1) col,(S) = (X — Ix')' col,(X — 1x’) 
= (x14 ~ Xe) coly(MK — 1%")! +--+ + (eng — %) cOl,(MK — Ix’)' 


Since the column vectors of (MK — 1x’)’ sum to the zero vector, we can write, for 
example, col,(X — 1x’)’ as the negative of the sum of the remaining column vectors. 
After substituting for row,(X — 1x’)’ in the preceding equation, we can express 
col,(S) as a linear combination of the at most n — 1 linearly independent row vec- 
tors colo(X — 1x')’,...,col,(XK — 1x’)'. The rank of S is therefore less than or equal 
to n — 1, which—as noted at the beginning of the proof—is less than or equal to 
p — 1, and S is singular. This implies, from Result 2A.9, that |§] = 0. = 


134 Chapter 3 Sample Geometry and Random Sampling 


Result 3.4. Let the p X 1 vectors x, X2,-.-, Xn» where x; is the jth row of the data 
matrix X, be realizations of the independent random vectors XK), X2,...,Xq- Then 


1. Ifthe linear combination a’X; has positive variance for each constant vector a ¥ 0. 
then, provided that p < 7, S has full rank with probability 1 and |S| > 0. 
2: If, with probability 1, a’X; is a constant (for example, c) for all j, then [S| = 0. 


Proof. (Part 2). If a'X; = a, Xj, + Xj. to + a,X;) = c with probability 1, 


a'x; = c for all j, and the sample mean of this linear combination is ¢ = (a,x; 
a = = = j=l 
+ agXj2 t-07 + ApXjp)/n = 1X1 + 4X2 +--+ apx, = a’x. Then / 
7 Xip — Xp 
(X — 1x')a = a teeta, 
Xn ~ X1 Xnp = Xp 
a’x, — a’x c-c 
— = =0Q 
a’x, — aX c-c 
indicating linear dependence; the conclusion follows from Result 3.2. 
The proof of Part (1) is difficult and can be found in [2]. = 


Generalized Variance Determined by |R| 
and Its Geometrical Interpretation 


The generalized sample variance is unduly affected by the variability of measure- 
ments on a single variable. For example, suppose some 5;; is either large or quite 
small. Then, geometrically, the corresponding deviation vector d; = (y; ~ X,1) will 
be very long or very short and will therefore clearly be an important factor in deter- 
mining volume. Consequently, it is sometimes useful to scale all the deviation vec- 
tors so that they have the same length. 

Scaling the residual vectors is equivalent to replacing each original observation 
x;x by its standardized value (xj ~ %x)/ Vs,,. The sample covariance matrix of the 
standardized variables is then R, the sample correlation matrix of the original vari- 


ables. (See Exercise 3.13.) We define 


Generalized sample variance ) _ R 
of the standardized variables ) ~ || (3-19) 

Since the resulting vectors 
[(41% = Xx)/ V Skks (xx - X)/ VSkkoeees (Xnk = Xx)/ V sxx] = (Yn = %1)'/V ux 


all have length Vn — 1, the generalized sample variance of the standardized vari- 
ables will be large when these vectors are nearly perpendicular and will be small 


Generalized Variance 135 


when two or more of these vectors are in almost the same direction. Employing the 
argument leading to (3-7), we readily find that the cosine of the angle 6;, between 
(y; — ¥1)/Vs;; and (y, — %1)/Vs,, is the sample correlation coefficient 7x. 
Therefore, we can make the statement that |R| is large when all the 7;, are nearly 
zero and it is small when one or more of the 7;, are nearly +1 or ~1. 

In sum, we have the following result: Let 


Xj 4% 
V Si 
= X25 - Xj 
— x1 — 
v4 ye ak. Rea ey 
ii - 
Xni — Xi 
V Sii 


be the deviation vectors of the standardized variables. The ith deviation vectors lie 
in the direction of d;, but all have a squared length of n — 1. The volume generated 
in p-space by the deviation vectors can be related to the generalized sample vari- 
ance. The same steps that lead to (3-15) produce 


Generalized sample variance a ep 2 2 
e the standardized ieee HRD agr a) tyelume) en) 


The volume generated by deviation vectors of the standardized variables is il- 
lustrated in Figure 3.10 for the two sets of deviation vectors graphed in Figure 3.6. 
A comparison of Figures 3.10 and 3.6 reveals that the influence of the d, vector 
(large variability in x2) on the squared volume |$| is much meats than its influ- 
ence on the squared volume |R|. 


(a) (b) 
Figure 3.10 The volume generated by equal-length deviation vectors of 
the standardized variables. 


136 Chapter 3 Sample Geométry and Random Sampling 


The quantities |S | and | R| are connected by the relationship 


|S| = (s11522°-+5pp)|R| (3-21) 
sO 


~ (nm —1)|8| = (n — 1)?(s1152°+-5pp)|R| (3-22) 


[The proof of (3-21) is left to the reader as Exercise 3.12.] 

Interpreting (3-22) in terms of volumes, we see from (3-15) and (3-20) that the 
squared volume (n — 1)?|S| is proportional to the squared volume (n — 1)?|R|. 
The constant of proportionality is the product of the variances, which, in turn, is 
proportional to the product of the squares of the lengths (m — 1)s;; of the d,. 
Equation (3-21) shows, algebraically, how a change in the-measurement scale of Xj, 
for example, will alter the relationship between the generalized variances. Since |R | 
is based on standardized measurements, it is unaffected by the change in scale. 
However, the relative value of |S| will be changed whenever the multiplicative 
factor s;; changes. 


Example 3.11 (Illustrating the relation between |S| and |R|) Let us illustrate the 
relationship in (3-21) for the generalized variances |S| and |R| when p = 3, 
Suppose 


4 
=|3 
1 


NYO WwW 


1 
re 2 
(3x3) 1 


Then 5,, = 4, 522 = 9, and s33 = 1. Moreover, 


wIN = Nir 
me WIN NI 


Using Definition 2A.24, we obtain 


3 


Ci ae 
= —1)? + 
Is| ‘[ AX 1)? +3), 


2 3 
-1p + 
AK 141 


r Say 
= 4(9 — 4) — 3(3 — 2) + 1(6 -9) = 14 
mRi= a) 3 
5 
=(1-3)-G@)G-3)+@G-D=8 


It then follows that 


2 
3 
1 


(A149 


(-1)* 


1 
2 
3 


14 = |S| = 5115225331 | = (4)(9)(1)(S) = 14 (check) 


Sample Mean, Covariance, and Correlation as Matrix Operations 137 


Another Generalization of Variance 


We conclude-this discussion by mentioning another generalization of variance. 
Specifically, we define the total sample variance as the sum of the diagonal elements 
of the sample variance—covariance matrix §. Thus, 


Total sample variance = 51; + 522 + +++ + Spp (3-23) 


Example 3.12 (Calculating the total sample variance) Calculate the total sample 
variance for the variance—covariance matrices § in Examples 3.7 and 3.9. 
From Example 3.7. 


Se ke Bal 


—68.43 123.67 
and 
Total sample variance = s,; + s22 = 252.04 + 123.67 = 375.71 
From Example 3.9, 
3 -2 0 
S=|-2? 1 3 
0 4:1 
and 
Total sample variance = 5,;; + 522 + 553 =3+1+1=5 | 


Geometrically, the total sample variance is the sum of the squared lengths of the 
p deviation vectors d, = (y, — ¥,1),...,d, = (yp — ¥pl), divided by n — 1. The 
total sample variance criterion pays no attention to the orientation (correlation 
structure) of the residual vectors. For instance, it assigns the same values to both sets 
of residual vectors (a) and (b) in Figure 3.6. 


3.5 Sample Mean, Covariance, and Correlation 
as Matrix Operations 


We have developed geometrical representations of the data matrix X and the de- 
rived descriptive statistics x and S. In addition, it is possible to link algebraically the 
calculation of X and S directly to X using matrix operations. The resulting expres- 
sions, which depict the relation between x, S, and the full data set X concisely, are 
easily programmed on electronic computers. 


138 Chapter 3 Sample Geometry and Random Sampling 


We have it that X; = (1y;°1 + x,;-1 +--+ +-x,;'1)/n = yj1/n. Therefore, 


Xx yil %11 X12 > Xin 1 
n 
x2 yol 7 leas ec, 1 
x= Z = n x= J 
n 
Xp Yl Xp1 Xp2 mas Xpn 1 
n 
or 
wen” ce 
¥=—X'1 (3-24) 


That is, x is calculated from the transposed data matrix by postmultiplying by the 
vector 1 and then multiplying the result by the constant 1/n. 

Next, we create ann X p matrix of means by transposing both sides of (3-24) 
and premultiplying by 1; that is, 


xy X2 Xp 
: s e 

1x =-11'X =|" 72 (3-25) 
X, Xy Xp 


Subtracting this result from X produces the n X p matrix of deviations (residuals) 


X11 ~ X, Xy2-— XM yp — Xp 
1 Xqy ~~ Xy X22 —- XX + Hy, -X 
Ko Sri Kee | ee ee (3-26) 
n : : _ 4 
Xn1~ XX, Xn2-— X27: Xnp ~ Xp 


Now, the matrix (m — 1)S representing sums of squares and cross products is just 
the transpose of the matrix (3-26) times the matrix itself, or 


Hyp — Xp Xap — Xp Xp — Ky 
X12 — X2  XQ2 — Xz ln — 
(n -1)S= : : : : 

Xp — Xp X2p — Xp Xnp ~ Xp 
X11 —~X% Xj2—~ %2 °° Xp ~ Xp 
X21 ~%, X22 ~ %2 0" Xap — X 

x : : ; F 

Xn — Xy XnQ — 2 ace Xnp — Xp 


_ (x B 14x) (x = 141X) = x(1 > 141) x 
43 43 43 


Sample Mean, Covariance, and Correlation as Matrix Operations 139 
since 
1 a 1 1 1 
(1 _ iw) (1 - <1) =I--il' - tay + Jaw =[--I1l 
n n L n n n n 
To summarize, the matrix expressions relating X and S to the data set X are 
ly, 
x=-—X'l 
n 
1 i 1 
= -——_X'|1-—-11 |X (3-27) 
n-1 n 
The result for S,, is similar, except that 1/n replaces 1/(m — 1) as the first factor. 
The relations in (3-27) show clearly how matrix operations on the data matrix 
X lead to x and S. 
Once § is computed, it can be related to the sample correlation matrix R. The 


resulting expression can also be “inverted” to relate R to S. We first define the p X p 
sample standard deviation matrix D'/? and compute its inverse, (D2) ' = D~¥”. Let 


Va 0 ae 0 
pee o Mar 2 (3-28) 
(pXp) : : . : 
0 0 a3 Vspp 
Then 
J) 
0 ies 0 
VS11 
1 
0 0 
D2 = V'822 : 
{px p) : : 
1 
0 0 ay ad 
V'Spp 
Since 
Sit 512 Sip 
S= : 
Sip S2p Spp 
and 
Si 512 nak Sip 
Vs V8, V511 V822 V511 VSpp Lory *t* Np 
R= . S| 2g te 
Sip S2p 7 Spp Tip "2p 1 
we have 


R=D'“sp2 (3-29) 


140 Chapter 3 Sample Geometry and Random Sampling 
Postmultiplying and presmallpying both sides of (3-29) by D’” and noting that 
pop? = p'?p? = I gives 
S=D'”’ RD? (3-30) 


That is, R can be obtained from the information in S, whereas S can be obtained from 
D!” and R. Equations (3-29) and (3-30) are sample analogs of (2-36) and (2-37). 


3.6 Sample Values of Linear Combinations of Variables 


We have introduced linear combinations of p variables in Section 2.6. In many multi- 
variate procedures, we are led naturally to consider a linear combination of the form 


X= 4X + 2X +--+ + cpX, 
whose observed value on the jth trial is 

eX; = Xj) + QXjq t+ + cpxjp, =f = 1,2,...,0 (3-31) 
The n derived observations in (3-31) have 


(e’x, + ¢’x, +--+ + ¢’x,) 


Sample mean = 
n 


' 1 2 
Hex + xy tot My) = CX (3-32) 
Since (cx; — e'x) = (e'(x; - x)) = ¢'(x; — )(x; — X)'c, we have 


(c'x, - cK) + (e'x, — e'xy +e + (e'x, — e’x)’ 
n-1 


Sample variance = 


3 '(x; ~ x) (x; — X)'c + €'(x, — X)(X, — X)’c +--+ + €’(x, — X)(x, — ¥)e 


n-1 
t= 8) OG — 8)! + 2 — 8) (Ka — XY Ft Kn = X) (Xn 7) 
=F n-1 e 
or 
Sample variance of c’X = c’Sc (3-33) 


Equations (3-32) and (3-33) are sample analogs of (2-43). They correspond to sub- 
stituting the sample quantities x and S for the “population” quantities 4 and X, 


respectively, in (2-43). 
Now consider a second linear combination 


b’X = bX, + bX, +--+ + bX, 
whose observed value on the th trial is 


b’x; = bx; + byxj2 to + bpxj,, Ff =1,2,...,0 (3-34) 


Sample Values of Linear Combinations of Variables 14] 


It follows from (3-32) and (3-33) that the sample mean and variance of these 


derived observations are 
Sample mean of b’X = b’x 


Sample variance of b’X = b'Sb 
Moreover, the sample covariance computed from pairs of observations on 
b’X and c’X is 
Sample covariance 
__ (b'x, — b’X)(c'x, — ¢’x) + (b'x, — b’x)(c'x, — e’x) +--+ + (b’x, — b’X)(€'X, ~ c’X) 


n-1 
_ bY(xy — X) (x — X)'¢ + b'(xy — X) (x2 — ®)'e +--+ + D(x, — X)(X_ — X)'C 
7 n-1 
{@ — ®)(x1 — X)' + (x2 — 8) (2 — F) + + (x, — BK, | 
=e n-1 - 


or 


Sample covariance of b’X and c’X = b’Sc (3-35) 


In sum, we have the following result. 


Result 3.5. The linear combinations 
b’X = bX, + bX, +--+ + bX, 
eX = ¢)X, + 2X +--+ + cpX> 
have sample means, variances, and covariances that are related to x and S by 
Sample mean of b’X = b’x 
Sample mean of ¢’X = e’x 
Sample variance of b'X = b’Sb (3-36) 
Sample variance of c’X = c’Sce 


Sample covariance of b’X and c’X = b’Sc 


Example 3.13 (Means and covariances for linear combinations) We shall consider 
two linear combinations and their derived values for the n = 3 observations given 


in Example 3.9 as 


%11 X12 X13 1 2 5 
xX =) X%., X22 423) = 416 
X31; X32 1433 40 4 


Consider the two linear combinations 


b’X =[2 2 —1]| X, | = 2X, + 2X, — X;3 


{42 Chapter 3 Sample Geometry and Random Sampling 
and 


eX=[1 -1 3]) 1) =X -— 1 + 3% 


The means, variances, and covariance will first be evaluated directly and then be 
evaluated by (3-36). 

Observations on these linear combinations are obtained by replacing X,, X, 
and X; with their observed values. For example, the n = 3 observations on b’X are 
b’x; = 2X41 + 2X12 413 = 2(1) + 2(2) = (5) =] 

b’x, = 2x21 + 2x22 — X23 = 2(4) + 2(1) — (6) = 4 
b’x3 = 2X3) + 2X32 =—.193. = 2(4) + 2(0) ~ (4) =4 


The sample mean and variance of these values are, respectively, 


(1+4+4 4) 
Sample mean = ae Sa 
1-3) + (4-3) + (4-37 
Se ieyalenccet ) a8 (4 — 3) a 


In asimilar manner, the n = 3 observations on c’X are 
e’x, = Lay — Ley. + 3x43 = 1(1) - 1(2) + 3(5) = 14 
e’x, = 1(4) — 1(1) + 3(6) = 21 
e'x, = 1(4) — 1(0) + 3(4) = 16 


and 
(14 + 21 + 16) 
Sample mean = a as 
14 — 17)? + (21 - 17)? — 17) 
Sample variance = ( oe es = 1) = 13 


3-1 
Moreover, the sample covariance, computed from the pairs of observations 
(b’x1, ¢'X1), (b'x2, ¢'X2), and (b’x3, ¢’Xx3), is 
Sample covariance 
_ Q— 3) (14 — 17) + (4 — 3) 21 — 17) + (4 ~ 3) (16 = 17) 9 
3-1 1 


Alternatively, we use the sample mean vector x and sample covariance matrix § 
derived from the original data matrix X to calculate the sample means, variances, 
and covariances for the linear combinations. Thus, if only the descriptive statistics 
are of interest, we do not even need to calculate the observations b’x, and c’x;. 

From Example 3.9, 7 


Nie Re iw 
mnie © 


Sample Values of Linear Combinations of Variables 143 


Consequently, using (3-36), we find that the two sample means for the derived 
observations are 


3 

Sample mean of b’X = b’x = [2 2 ~—1}] 1] =3 (check) 
5 

Sample mean ofe’K = e’x = {1 -—1 3]]}1]=17 (check ) 
5 


Using (3-36), we also have 
Sample variance of b’'X = b’Sb 


3 -2 o|f 2 
=(2 2 -1]]/-2 1 3 2 
1 
o #a)L-1 
3 
=[2 2 -1]| -2}=3 (check) 
0 
Sample variance of c’X = ¢’Sc 
3 ~2 0 1 
3 1 
=[f-sk 3] 3° be ey) Sl 
- 4 
0 5 1 
9 
2 
={1 -1 +) = 13 (check) 
5 
2 
Sample covariance of b’X and c’X = b’Sc 
3 -3 0 1 
=(2 2 -1)}-32 1 $])-1 
0 51 3 


NI 


9 
2 
=(2 2 ~-1}} -1]=2 (check) 
5 
2 


As indicated, these last results check with the corresponding sample quantities 
computed directly from the observations on the linear combinations. = 


The sample mean and covariance relations in Result 3.5 pertain to any number 
of linear combinations. Consider the g linear combinations 


Qj, X1 + ai2X_ + +++ + aipX,, i = 1,2,...,9 (3-37) 


144 Chapter 3 Sample Geometry and Random Sampling 


Exercises 


3.1. 


3.2. 


3.3. 


3.4. 


These can be expressed in matrix notation as 


ayy X} + 412X> eee a1,Xp @1 @2 -"- Q1p Xx 

a,,X, + Q929Xq te + a,Xp _| 1 422 “+ agp X) 
: ; : : Ra We afd : : | = AX 

MgiX, + Bg7X, ++ ag pXp M1 92 *** Agp | Xp 
(3-38) 


Taking the ith row of A, aj, to be b’ and the kth row of A, aj, to be c’, we see that 
Equations (3-36) imply that the ith row of AX has sample mean a;x and the ith and 
kth rows of AX have sample covariance ajS a, . Note that ajS a, is the (i, k)th ele- 


ment of ASA’. 


Result 3.6. The g linear combinations AX in (3-38) have sample mean vector Ax 
and sample covariance matrix ASA’. a 


Given the data matrix 
9 1 
X=/5 3 
42 


(a) Graph the scatter plot in p = 2 dimensions. Locate the sample mean on your diagram. 

(b) Sketch the n = 3-dimensional representation of the data, and plot the deviation 
vectors yy — Land y2 — X2i. 

(c) Sketch the deviation vectors in (b) emanating from the origin. Calculate the lengths 
of these vectors and the cosine of the angle between them. Relate these quantities to 


S, and R. 


Given the data matrix 


_ 


(a) Graph the scatter plot in p = 2 dimensions, and locate the sample mean on your diagram, 

(b) Sketch the n = 3-space representation of the data, and plot the deviation vectors 
yy — XL and yy - x2. 

(c) Sketch the deviation vectors in (b) emanating from the origin. Calculate their lengths 
and the cosine of the angle between them. Relate these quantities to S,, and R. 

Perform the decomposition of y; into x1 and y,; — x] using the first column of the data 

matrix in Example 3.9. 

Use the six observations on the variable X), in units of millions, from Table 1.1. 

(a) Find the projection on 1’ = [1,1,1, 1,1, 1]. 

(b) Calculate the deviation vector y,; — X,1. Relate its length to the sample standard 
deviation. 


3.5. 


3.6. 


3.7. 


3.8. 


3.9. 


Exercises 145 


(c) Graph (to scale) the triangle formed by y,, ¥;1, and y, ~ X,1. Identify the length of 
each component in your graph. 


(d) Repeat Parts a—c for the variable X2 in Table 1.1. 
(e) Graph (to scale) the two deviation vectors y, — ¥,1 and y. ~— ¥21. Calculate the 
value of the angle between them. 


Calculate the generalized sample variance |S | for (a) the data matrix X in Exercise 3.1 
and (b) the data matrix X in Exercise 3.2. 


Consider the data matrix 


~1 3 -2 
X=| 24 2 
5 2 3 


(a) Calculate the matrix of deviations (residuals), X — 1x’. Is this matrix of full rank? 
Explain. 

(b) Determine S$ and calculate the generalized sample variance |S |. Interpret the latter 
geometrically. 

(c) Using the results in (b), calculate the total sample variance. [See (3-23).] 


Sketch the solid ellipsoids (x — x)'S7'(x — x) < 1 [see (3-16)] for the three matrices 


2 ee rd 


(Note that these matrices have the same generalized variance |S |.) 
Given 


100 1 -} ; 
S=|0 1 0] and s=| -} ; 
001 a ee | 
(a) Calculate the total sample variance for each S. Compare the results. 


(b) Calculate the generalized sample variance for each S, and compare the results. Com- 
ment on the discrepancies, if any, found between Parts a and b. 


The following data matrix contains data on test scores, with x, = score on first test, 
x2 = score On second test, and x3 = total score on the two tests: 


12 17 29 
18 20 38 
X=/ 14 16 30 
20 18 38 
16 19 35 


(a) Obtain the mean corrected data matrix, and verify that the columns are linearly de- 
pendent. Specify an a’ = [@, a2, a3] vector that establishes the linear dependence. 

(b) Obtain the sample covariance matrix S, and verify that the generalized variance is 
zero. Also, show that Sa = 0, so acan be rescaled to be an eigenvector correspond- 
ing to eigenvalue zero. 


(c) Verify that the third column of the data matrix is the sum of the first two columns. 
That is, show that there is linear dependence, with a, = 1, a, = 1, anda; = —1. 


146 Chapter3 Sample Geometry and Random Sampling 


3.10. When the generalized variance is Zero, it is the columns of the mean corrected data 
matrix X, = X — 1x’ that are linearly dependent, not necessarily those of the data 
matrix itself. Given the data 


Arn HAW 
Won fe 
PRWN DW O 


(a) Obtain the mean corrected data matrix, and verify that the columns are linearly 
dependent. Specify an a’ = (a, @2, 43] vector that establishes the dependence. 


(b) Obtain the sample covariance matrix S, and verify that the generalized variance is 
ZeIo. 
(c) Show that the columns of the data matrix are linearly independent in this case. 


3.11. Use the sample covariance obtained in Example 3.7 to verify (3-29) and (3-3 i 
state that R = D™/SD~'/? and D‘?RD!? = S. fy (3-29) (3-30), which 


3.12. Show that |S} = (811522°* Spp)|R|- 
Hint: From Equation (3-30), S = D'?RD‘?. Taking determinants gives |S| = 
|p? || R || D'7|. (See Result 24.11.) Now examine |p|, 

3.13. Given a data matrix SM and the resulting sample correlation matrix R. 


consider the standardized observations § (x;, ~ Xy)/VWsxx,  & = 1,2,...,p. 
j= 1,2,-..5%: Show that these standardized quantities have sample Covariance 


matrix R. 


3.14. Consider the data matrix XX in Exercise 3.1. We have n = 3 observations on p = 2 vari- 
ables X; and X2. Form the linear combinations 


' Xx, 
bX = (2 si] = 2x 43% 
2 


(a) Evaluate the sample means, variances, and covariance of b’X and c’X from first 
principles. That 1s, calculate the observed values of b’X and ¢c’X, and then use the 
sample mean, Variance, and covariance formulas. 


(b) Calculate the sample means, variances, and covariance of b’X and e'X usi : 
Compare the results in (a) and (b). tismg. (2°36). 


3.15. Repeat Exercise 3.14 using the data matrix 


143 
X=/6 2 6 
8 33 


Exercises 147 
and the iifiea combinations 
b’‘X=(1 1 1]| xX 
and 


eX =[1 2 -3]| xX, 


3.16. Let V be a vector random variable with mean vector E(V) = «, and covariance matrix 
E(V — py)(V — wy)’ = Sy. Show that E(VV') = Sy + wypey. 


3.17. Show that, if <. and an are independent, then each component of X is 
px qx f 


independent of each component of Z. 


Hint: P| X, = x1, X2 s X,.-.,Xp s xpand Z, = Z,---, La = 2q) 
kod PLX, = x,,X2 = X2,...,Xp = Xp\°P[Zy = 21,+..,24 = 2y] 
by independence. Let x2,..., x, and z2,..., Z, tend to infinity, to obtain 


P[X, = x, and Z, = 7] = P[X, s x,)°P[Z, sz] 
for all x, z,. So X; and Z, are independent. Repeat for other pairs. 
3.18. Energy consumption in 2001, by state, from the major sources 
x, = petroleum x2 = natural gas 
x3 = hydroelectric power x4 = nuclear electric power 


is recorded in quadrillions (10!) of BTUs (Source: Statistical Abstract of the United 
States 2006). 
The resulting mean and covariance matrix are 


0.766 0.856 0.635 0.173 0.096 
z- 0.508 s= 0.635 0.568 0.128 0.067 
0.438 0.173 0.127 0.171 0.039 
0.161 0.096 0.067 0.039 0.043 


(a) Using the summary statistics, determine the sample mean and variance of a state’s 
total energy consumption for these major sources. 


(b) Determine the sample mean and variance of the excess of petroleum consumption 
over natural gas consumption. Also find the sample covariance of this variable with 
the total variable in part a. 


3.19. Using the summary statistics for the first three variables in Exercise 3.18, verify the 
relation 


|S | = (511 522 533) |R| 


148 Chapter 3 Sample Geometry and Random Sampling 


3.20. In northern climates, roads must be cleared of snow quickly following a storm. One 
measure of storm severity is x, = its duration in hours, while the effectiveness of snow 
removal can be quantified by x. = the number of hours crews, men, and machine, spend 
to clear snow. Here are the results for 25 incidents in Wisconsin. 


“Table 3.2 Snow Data 


xy 
12.5 13.7 9.0 24.4 35 26.1 


(a) Find the sample mean and variance of the difference x2 — x, by first obtaining the 
summary Statistics. 

(b) Obtain the mean and variance by first obtaining the individual values xp — xj, 
for j= 1,2,...,25 and then calculating the mean and variance. Compare these values 
with those obtained in part a. 


REGENCE a a 


1. Anderson, T. W. An Introduction to Multivariate Statistical Analysis (31d ed.). New York: 
John Wiley, 2003. 

2. Eaton, M., and M. Perlman.“The Non-Singularity of Generalized Sample Covariance 
Matrices.” Annals of Statistics, 1 (1973), 710-717. 


Chapter 


THE MULTIVARIATE NORMAL 
DISTRIBUTION 


4.1 Introduction 


A generalization of the familiar bell-shaped normal density to several dimensions plays 
a fundamental role in multivariate analysis. In fact, most of the techniques encountered 
in this book are based on the assumption that the data were generated from a mullti- 
variate normal distribution. While real data are never exactly multivariate normal, the 
normal density is often a useful approximation to the “true” population distribution. 

One advantage of the multivariate normal distribution stems from the fact that 
it is mathematically tractable and “nice” results can be obtained. This is frequently 
not the case for other data-generating distributions. Of course, mathematical attrac- 
tiveness per se is of little use to the practitioner. Jt turns out, however, that normal 
distributions are useful in practice for two reasons: First, the normal distribution 
serves as a bona fide population model in some instances; second, the sampling 
distributions of many multivariate statistics are approximately normal, regardless of 
the form of the parent population, because of a central limit effect. 

To summarize, many real-world problems fall naturally within the framework of 
normal theory. The importance of the normal distribution rests on its dual role as 
both population model for certain natural phenomena and approximate sampling 
distribution for many statistics. 


4.2 The Multivariate Normal Density and Its Properties 


The multivariate normal density is a generalization of the univariate normal density 
to p = 2 dimensions. Recall that the univariate normal distribution, with mean yp 
and variance o?, has the probability density function 


el-BVel2 = _ gw <x < 0 (4-1) 


_ 1 
f(x) = iat 


149 


150 Chapter 4 The Multivariate Normal Distribution 


Figure 4.1. A normal density 
with mean yz and variance a 
and selected areas under the 
curve. 


A plot of this function yields the familiar bell-shaped curve shown in Figure 4.1, 
Also shown in the figure are approximate areas under the curve within +1 standard 
deviations and +2 standard deviations of the mean. These areas represent probabil- 


ities, and thus, for the normal random variable X, 
PQwu-o SX Spt+o)= 68 
Pp - 20 =X Spt 2) = 95 
It is convenient to denote the normal density function with mean p and vari- 


ance o” by N(j2, 0). Therefore, N(10, 4) refers to the function in (4-1) with u = 10 
and o = 2. This notation will be extended to the multivariate case later. 


The term 


ss 2 
(E4) = - ee w) (42) 


in the exponent of the univariate normal density function measures the square of 
the distance from x to yz in standard deviation units. This can be generalized for a 


p X 1 vector x of observations on several variables as 
(x — #)'=7'(x — #) (4-3) 


The p X 1 vector yx represents the expected value of the random vector X, and the 
p X pmatrix & is the variance-covariance matrix of X. [See (2-30) and (2-31).] We 
shall assume that the symmetric matrix x is positive definite, so the expression in 
(4-3) is the square of the generalized distance from x to mu. 

The multivariate normal density is obtained by replacing the univariate distance 
in (4-2) by the multivariate generalized distance of (4-3) in the density function of 
(4-1). When this replacement is made, the univariate normalizing constant 
(20 yg?) 1? must be changed toa more general constant that makes the volume 
under the surface of the multivariate density function unity for any p. This is neces- 
sary because, in the multivariate case, probabilities are represented by volumes 
under the surface over regions defined by intervals of the x; values. It can be shown 
(see [1]) that this constant is (20) | Z|, and consequently, a p-dimensional 
normal density for the random vector X’ = [X1, X,..., X,] has the form 


1 Ses 
10) = Games EP (44) 


where —0o < x; < 00,i = 1,2,..., p. We shall denote this p-dimensional normal 
density by N,( #, X), Which is analogous to the normal density in the univariate 


case. 


The Multivariate Normal Density and Its Properties 15! 


Example 4.1 (Bivariate normal density) Let us evaluate the p = 2-variate normal 

density in terms of the individual parameters pw, = E(X1), bz = E(%2), 

01 = Var(X1), 022 = Var(X2), and py2 = 042/( Voi, Vo22) = Corr (X;, X2). 
Using Result 2A.8, we find that the inverse of the covariance matrix 


O11 O12 
>> = 
G12 922 


yo = pee eee 022 —912 
2.) — 
011022 — 072 O12 Fi 


Introducing the correlation coefficient p,2 by writing 012 = py2Vo01,; Vo22, We 
obtain 041022 — 042 = 011022(1 — pz), and the squared distance becomes 


(x — p)'E7'(x ~ p) 


is 


[ — 
= {X1 — Bi, *2 ~ M2 
0410721 — Pi2) 
022 —p12Vo1; Vor2 | | x1 A 
—Pi2VO11 VO22 O11 | L%2 7 #2 


022(x4 — Ha)® + O11(%2 ~ we)? — 2p12Vo 11 VO22 (21 — He) (2 ~ Ma) 
©11022(1 — pi2) 


mee. (2 iS n) n (2 = al 2 (2 = H) (2 = a) | (4-5) 
1-pRL\ Von Von J \ Vou } \ Var 

The last expression is written in terms of the standardized values (x, — )/ Vou and 

(x2 ~ H2)/Vor22- 

Next, since |%| = 011022 — of2 = 01102(1 — piz), we can substitute for £7! 
and || in (4-4) to get the expression for the bivariate (p = 2) normal density 
involving the individual parameters 1, 42, 011, 022, and py2: 

1 


eee 2aV041022(1 — pir) 
Mex {sq tas 2) 4 Grol 
PL 20 — BLN Ven Von 


_2 17 BI as) I 
P12 Vou 5 lo 


The expression in (4-6) is somewhat unwieldy, and the compact general form in 
(4-4) is more informative in many ways. On the other hand, the expression in (4-6) is 
useful for discussing certain properties of the normal distribution. For example, if the 
random variables X, and X> are uncorrelated, so that p;2 = 0, the joint density can 
be written as the product of two univariate normal densities each of the form of (4-1). 


(4-6) 


152 Chapter 4 The Multivariate Norma! Distribution 


That is, f(x, x2) = f(x1)f (x2) and Xj and X, are independent. [See (2-28).] This 
result is true in general. (See Result 4.5.) 

Two bivariate distributions with o1; = 022 are shown in Figure 4.2. In Figure 
4.2(a), X; and X, are independent (,, = 0). In Figure 4.2(b), pi2 = .75. Notice how 
the presence of correlation causes the probability to concentrate along a line. ] 


FQ, *2) 


(a) 


F(x, 2) 


Hy} 
IN N AN 


‘ 


X) XY) 
YR 


(b) 


Figure 4.2 Two bivariate normal distributions. (a) 01; = oz and p12 = 0. 
(b) 011 = 022 and pPi2 >= 75. 


The Multivariate Normal Density and Its Properties 153 


From the expression in (4-4) for the density of a p-dimensional normal variable, it 
should be clear that the paths of x values yielding a constant height for the density are 
ellipsoids. That is, the multivariate normal density is constant on surfaces where the 
square of the distance (x — yr)’&~1(x — x) is constant. These paths are called contours: 
Constant probability density contour = {all x such that (x — )'Z1(x — p) = c’} 

= surface of an ellipsoid centered at x 

The axes of each ellipsoid of constant density are in the direction of the eigen- 
vectors of £7’, and their lengths are proportional to the reciprocals of the Square 
roots of the eigenvalues of =71. Fortunately, we can avoid the calculation of ="! when 


determining the axes, since these ellipsoids are also determined by the eigenvalues 
and eigenvectors of &. We state the correspondence formally for later reference. 


Result 4.1. If & is positive definite, so that £7! exists, then 
1 
Ze = Ae implies Ye = (+)e 


so (A, e) is an eigenvalue-eigenvector pair for & corresponding to the pair (1/4, e) 
for =. Also, Z! is positive definite. 


Proof. For & positive definite ande # 0 an eigenvector, we have 0 < e’Ze = e’(Ze) 
= e'(Ae) = Ae’e = A. Moreover, e = X1(Xe) = Z(Ae), or e = ALe, and divi- 
sion by A > 0 gives &1e = (1/A)e. Thus, (1/A, e) is an eigenvalue-eigenvector pair 
for £1. Also, for any p X 1x, by (2-21) 


x 2x = > (+ evi) 
“Sears 


since each term Aj1(x' e;)° is nonnegative. Jn addition, x’e; = 0 for all i aie if 
= 0. So x # 0 implies that > C/A) (x" e;)’ > 0, and it follows that 27" 


=1 
positive definite. = 


The following summarizes these concepts: 
Contours of constant density for the p-dimensional normal distribution are 
ellipsoids defined by x such the that 
(x — p)'Z'x — w) = 27 (4-7) 
These ellipsoids are centered at w and have axes +cV/A;e,;, where Le; = A,¢; 


fori = 1,2,..., p. 


A contour of constant density for a bivariate normal distribution with 
01) = 027 is obtained in the following example. 


154 Chapter 4 The Multivariate Normal Distribution 


Example 4.2 (Contours of the bivariate normal density) We shal] obtain the axes of 
constant probability density contours for a bivariate normal distribution when 
11 = 22. From (4-7), these axes are given by the eigenvalues and eigenvectors of 
&. Here |X — Al| = 0 becomes 


O11 7A a 
u = (04, - A) - of, 


12 
O12, Oy, — A 
= (A- 041 — o12)(A — 01, + 042) 


Consequently, the eigenvalues are Ay = a1; + 0,2 and Az = oy; — o}2. The eigen- 
vector e, is determined from 


C C. é; eé 
Pee | "1 = (011+ o)] 7 
G12 911} 12 e2 


a4€ + O42@2 = (041 + ay2)e; 


or 


G12) + F132 = (011 + 12 )e2 


These equations imply that e; = e), and after normalization, the first eigenvalue~ 
eigenvector pair is 


i 
V3 , 
A=HOy+ 012, OL = 1 


v2 


Similarly, A, = 01, ~ 12 yields the eigenvector e3 = [1/V2, -1/V2]. 

When the covariance a; (or correlation pj) is positive, A; = 01; + 02 isthe 
largest eigenvalue, and its associated eigenvector e; = [1/V2, 1/V2] lies along 
the 45° Jine through the point ye’ = | 4), 4]. This is true for any positive value of 
the covariance (correlation). Since the axes of the constant-density ellipses are 
given by -+cVAj;e, and +cVAz 2 [see (4-7)], and the eigenvectors each have 
length unity, the major axis will be associated with the largest eigenvalue. For 
positively correlated normal random variables, then, the major axis of the 
constant-density ellipses will be along the 45° line through yx. (See Figure 4.3.) 


CYOy +012 


Figure 4.3 A constant-density 
contour for a bivariate normal 
distribution with 01; = a2, and 
o12 > 0 (orp > 0). 


The Multivariate Normal Density and Its Properties 155 


When the covariance (correlation) is negative, A, = a1; — 012 Will be the largest 
eigenvalue, and the major axes of the constant-density ellipses will lie along a line 
at right angles to the 45° line through yw. (These results are true only for 
O11 = 022.) 

To summarize, the axes of the ellipses of constant density for a bivariate normal 
distribution with 01; = o22 are determined by 


1 1 
+tcVoy + O42 and +cVo i, — 02 ae 
v2 ve = 


We show in Result 4.7 that the choice c* = y2(a), where x2(a) is the upper 
(100a)th percentile of a chi-square distribution with p degrees of freedom, leads to 
contours that contain (1 — a) X 100% of the probability. Specifically, the following 
is true for a p-dimensiona] normal distribution: 


The solid ellipsoid of x values satisfying 
(x ~ w)'E"'(x — w) = x7(@) (4-8) 


has probability 1 — a. 


The constant-density contours containing 50% and 90% of the probability under 
the bivariate normal surfaces in Figure 4.2 are pictured in Figure 4.4. 


* 


*) 


Figure 4.4 The 50% and 90% contours for the bivariate normal 
distributions in Figure 4.2. 


The p-variate normal density in (4-4) has a maximum value when the squared 
distance in (4-3) is zero—that is, when x = yw. Thus, # is the point of maximum 
density, or mode, as well as the expected value of X, or mean. The fact that yu is 
the mean of the multivariate normal distribution follows from the symmetry 
exhibited by the constant-density contours: These contours are centered, or balanced, 
at pw. 


156 Chapter 4 The Multivariate Normal Distribution 


Additional Properties of the Multivariate 
Normal Distribution 


Certain properties of the normal distribution will be needed repeatedly in our 
explanations of statistical models and methods. These properties make it possible 
to manipulate normal distributions easily and, as we suggested in Section 4.1, are 
partly responsible for the Popularity of the normal distribution. The key proper. 
ties, which we shall soon discuss in some mathematical detail, can be stated rather 
simply. 

The following are true for a.random vector X having a multivariate normal 
distribution: 

1, Linear combinations of the components of X are normally distributed. 

2. All subsets of the components of X have a (multivariate) normal distribution 


3. Zero covariance implies that the corresponding components are independent} 
‘distributed. : 


4. The conditional distributions of the components are (multivariate) normal. 


These statements are reproduced mathematically in the results that follow. Man 
of these results are illustrated with examples. The proofs that are included should 
help improve your understanding of matrix manipulations and also lead you 
to an appreciation for the manner in which the results successively build on 
themselves. 

; Result 4.2 can be taken as a working definition of the normal distribution With 
this in hand, the subsequent properties are almost immediate. Our partial proof of 
Result 4.2 indicates how the linear combination definition of a normal densit 
relates to the multivariate density in (4-4). 3 


Resutt 4.2. If X is distributed as N,(, %), then any linear combination of vari- 
ables a’X = aX, + a,X_ + +--+ a,X, is distributed as N(a’ pw, a’ Xa). Also, if a’X 
is distributed as N(a’u, a’%a) for every a, then X must be N,(m, 2). 


Proof. The expected value and variance of a’X follow from (2-43). Proving that 
a'Xis normally distributed if X is multivariate normal is more difficult. You can find 
a proof in [1]. The second part of result 4.2 is also demonstrated in [1]. = 


Example 4.3 (The distribution of a linear combination of the components of a normal 
random vector) Consider the linear combination a’X of a multivariate normal ran- 
dom vector determined by the choice a’ = [1,0,...,0]. Since 


The Multivariate Normal Density and Its Properties 157 


and 
By 
a’p = [1,0,...,0]] 8? | = wy 
Lp 
we have 
O11 O12 *** Op] } 1 
o Cc a 0 
a’Za = (1,0,...,0]] 22 ~ 7? Pll. | eon 
Tip %p “** Fpl LO 


and it follows from Result 4.2 that X, is distributed as N( 4,01). More generally, 
the marginal distribution of any component X; of X is N(;, 0;;). = 


The next result considers several linear combinations of a multivariate normal 
vector X. 


Result 4.3. If X is distributed as N,(, 2), the q linear combinations 


aX ai aX, 

A @71Xy + +--+ G29Xp 
(qXP)(pX1) : : 

49, X) a o Ag Xp 


are distributed as N,(Ap, AXA’). Also, in + , asi where d is a vector of 
px px 
constants, is distributed as N,(# + d, 2). 


Proof. The expected value E(AX) and the covariance matrix of AX follow from 
(2-45). Any linear combination b’(AX) is a linear combination of X, of the 
form a’X with a = A’b. Thus, the conclusion concerning AX follows directly from 
Result 4.2. 

The second part of the result can be obtained by considering a’(X + d) = 
a’'X + (a’d), where a’X is distributed as N(a’y,a’Xa). It is known from the 
univariate case that adding a constant a‘d to the random variable a’X leaves the 
variance unchanged and translates the mean to a’p + a’d = a'(p + d). Since a 
was arbitrary, X + d is distributed as N,( + d, =). a= 


Example 4.4 (The distribution of two linear combinations of the components of a 
normal random vector) For X distributed as N3(y, 2), find the distribution of 


xX 

X,~ X 1 -~1 0 
= =A 
bee E 1 - = ’ 


{58 Chapter 4 The Multivariate Normal Distribution 


By Result 4.3, the distribution of AX is multivariate normal with mean 


an=(] 10] | [ar — a 
O 1 -1)/"? |" Lae - ms 


and covariance matrix 


1 -1 0 O11 %2 13 1 0 
[ae 
ALA = 0 1 ~1 012 222 073 -1 1 
013 023 933_| 0-1 
7 1 
— | 711 ~ %12 912 — F22 013 — 923 -1 
712 ~— O13 922 — 023 923 ~ 033 | 0 -1 
_ | 11 ~ 2012 + o22 012 + 023 — 922 — 0713 
2 + 023 — G22 ~ O13 O22 — 2023 + 033 


Alternatively, the mean vector Ay and covariance matrix AXA’ may be veri- 
fied by direct calculation of the means and covariances of the two random variables 
Y, = X, — X, and ¥, = X2, — X3. = 


We have mentioned that all subsets of a multivariate normal random vector X 
are themselves normally distributed. We state this property formally as Result 4.4. 


Result 4.4. All subsets of X are normally distributed. If we respectively partition 
X, its mean vector yz, and its covariance matrix = as 


X, 
KS [OC gg Se! (qx1) 
(px1) - XX (px1) M2 
((p-4)x1) ((p-4)*1) 
and 
X12 
dha (9X4) (q%(p-4)) 
(pXp) Ya 


i X22 
((p—4)Xq)  ((p-9)x(P-4)) 


then X, is distributed as Nj (#41, 241). 


Proof. Set A =| I : 0 
(¢xp) (9x4) i (q@X(P-q)) 

To apply Result 4.4 to an arbitrary subset of the components of X, we simply relabel 

the subset of interest as X, and select the corresponding component means and 


covariances as yt, and % 11, respectively. = 


} in Result 4.3, and the conclusion follows. 


The Multivariate Normal Density and Its Properties 159 


Example 4.5 (The distribution of a subset of a normal random vector) 


If X is distributed as Ns(, Z), find the distribution of | al We set 
4 


X2 B2 022 | 
X, = ’ ‘— , 7 DY = 
: BE sis |] x is O44 


and note that with this assignment, X, 4, and & can respectively be rearranged and 
partitioned as 


¢) Be O22 024 :%12 923 O25 
oe Ba 24 O44 O14 Tye T45 
X=] X |, B=! By f, Z=] O12 4:01. 013 O15 
X3 B3 023 0341013 933 035 
Xs Bs O25 045 i015 O35 O55 
or 

X, By 

(2x1) (2x1) 
X = | : poe foe : pe 

X, i) Xa1 } Zaz 

(3x1) (x1) (3x2) | (3x3) 


Thus, from Result 4.4, for 


we have the distribution 


N. >> =N ae a 7) 
b( #1 11) ‘(bs O14 44 


It is clear from this example that the normal distribution for any subset can be 
expressed by simply selecting the appropriate means and covariances from the origi- 
nal st and &. The formal process of relabeling and partitioning is unnecessary. ™ 


We are now in a position to state that zero correlation between normal random 
variables or sets of normal random variables is equivalent to statistical independence. 


Result 4.5. 
(a) If X, and X, are independent, then Cov(X,, X,) = 0,aq, X q, matrix of 
(4,%1) (4X1) 
Zeros 
xc m |] [211k 
(b) If Fa is Maso [24], [z222]), then X, and X, are independent if 


and only if 2,2 = 0. 


160 Chapter 4 The Multivariate Normal Distribution 


(c) If X, and X, are independent and are distributed as Nq,(#1, 211) and 


x ae . 
Nz,(#2, X22), respectively, then [*] has the multivariate normal distribution 


: | wy | [Zu | 0 
neal [2h -e-]) 


Proof. (See Exercise 4.14 for partial proee B based upon factoring the density 
function when 2. = 0.) : a 


Example 4.6 (The equivalence of zero covariance and independence for normal 
variables) Let ah be N3(4e, &) with 


1 
3 
0 


M 
i 
ora 
Noo 


Are X; and X, independent? What about (X}, X2) and X3? 
Since X, and X, have covariance 012 = 1, theyare not independent. However, 
partitioning X and & as 


i 41/0 aa a 
Soa oT ce a a ae 
X; 0 0:2 (1X2) } (1X1) 


2 
(X,, X2) and X; are independent by Result 4.5. This implies X3 is independent of 


X and also of X2. al 


we see that X, = | and X3 have covariance matrix X42 = °|. Therefore, 


We pointed out in our discussion of the bivariate normal distribution that 
P12 = 0 (zero correlation) implied independence because the joint density function 
[see (4-6)] could then be written as the product of the marginal (normal) densities of 
X, and X,. This fact, which we encouraged you to verify directly, is simply a special 
case of Result 4.5 with gq, = q = 1. 


Result 4.6. Let X = 2] be distributed as N,(u,%) with p= el 
2 2 
z= [za 3 | and | X22{ > 0. Then the conditional distribution of X,, given 
21} 


that X, = x), is normal and has 


Mean = M; + ¥y2%93(x2 ~ M2) 


The Multivariate Normal Density and Its Properties 161 


and 
Covariance = X11 — 242%34%01 


Note that the covariance does not depend on the value x, of the conditioning 
variable. 


Proof. We shall give an indirect proof. (See Exercise 4.13, which uses the densities 
directly.) Take 


i — Zy2%5} 
(xa) |} qx(pnaq) 


wxe) | 0 | I 
(p—4)xq } (P—q)X(p-q) 


so 


Ak =p) a eae] Z [Ma naa | 
2 


pepe Besa eee I ee _ B= EabedEa 
0 i I X21 i X22 (- Eyp8h)"! 0 X22 ‘ 


Since KX, — yy — Xy2%74(K2 — wy) and X) — py have zero covariance, they are 
independent. Moreover, the quantity X, — a, — %42%74(K_ — pz) has distribution 
N,(0, 211 — %12%7}%21). Given that X_ = x2, @y + %12%7}(%_ — M2) isa constant. 
Because X; — py — %12%7)(XK2 — my) and X, — py are independent, the condi- 
tional distribution of X; — py — %42%73 (X. — M2) is the same as the unconditional 
distribution of XK, — wy — %12%7)(Kz — pro). Since Ky — py — ZyBz}(K2 — M2) 
is N,(0, 211 — 21227421), so is the random vector X; — wy — 2y2B7) (x2 — #2) 
when X, has the particular value x,. Equivalently, given that X, = x2, X, is distrib- 
uted as Nj(#, + 21223) (%2 — #2), Yar ~ %12%22 X21). La 


Example 4.7 (The conditional density of a bivariate normal distribution) The 
conditional density of X,, given that X, = x, for any bivariate distribution, is 
defined by 


f (x1, 2) 
f (x2) 


f (|x) = {conditional density of X, given that X, = x2} = 


where f(x2) is the marginal distribution of X,. If f(x,, x2) is the bivariate normal 
density, show that f(x, | x2) is 


2 

0712 O712 

N + —“(x O11, -—— 
é is! 27 Ha) O11 vi 


162 Chapter4 The Multivariate Normal Distribution 


Here 011 ~ 022/022 = 011(1 — pia). The two terms involving x; — in the expo- 
nent of the bivariate normal density [see Equation (4-6)] become, apart from the 
multiplicative constant -1/2(1 — pi2), 


(41 — pa)? 7) (x1 — 1) (%2 — #2) 
O71 oe Vo11 V922 


a O11 2 pin 
= SS Sage Gee SND is too oat 
ol 1 pie | 2 ~ Ba) aa 2 — My) 


Because p12 = 12/ W011 VO22, OF P12 Vo41/ Vox2 = 012/022, the complete expo- 


nent is 
-1 (x1 - #4)? 2 (41 = mi) (Qa 7), (2 = ba)? 
— 2p, 
2(1 — pi2) O11 : Vai; VO22 022 
-1 


( p a pate), 
= — — py —= (Hn - 
201 1(1 — pi2) ; ; 2 Vom i 


= ee eee 1 = aha) (x = ¥ 
2(1 — piz) \922 722 BSS 


a ae 712 2 4 (x2 — bn)? 
= xy - wa — (em - — = 
2041(1 ~ pi2) ( : , oa 6 w)] 2 o22 


The constant term 27 V 01 1922(1 - p%2) also factors as 
V29'Vo2 X V2a Voy(1 — pi2) 
Dividing the joint density of X, and X by the marginal density 


7 (22-#2)*/2022 


1 
f=) = Vie Ven 
and canceling terms yields the conditional density 


= Ff (x1, *2) 
F (x11 22) = F(x) 


ele (012/022)( 22-42) P2011 ~pf2) 


1 
Wr Vors(1 — Pia) 


“wi xry< © 


Thus, with our customary notation, the conditional distribution of X; given that 
Xp =x, is N(ua + (012/22) (22 — 2); gull — pi2)). Now, %y1— 2227221 = 
O41 — O2/o22 = 7111 ~ Pi) and X12%22 = o12/o22, agreeing with Result 46, 
which we obtained by an indirect method. _ 


The Multivariate Normal Density and Its Properties 163 


For the multivariate normal situation, it is worth emphasizing the following: 


1. All conditional distributions are (multivariate) normal. 
2. The conditional mean is of the form 


by + Bi g41(%q41 = Bg+1) Feet Bi,p(%p ~ yp) 
(4.9) 


Mg + Baq4i(%q41 = Hg+1) abr pate Ba.p(Xp = Hp) 
where the f’s are defined by 


Bi g+1 By,9+2 ae Bip 
2 P eke 
SS = B ql Bags Bap 


22 ~~ 


Bagti Bagt2 °° Bap 


3. The conditional covariance, X; — &17%7}%21, does not depend upon the value(s) 
of the conditioning variable(s). 


We conclude this section by presenting two final properties of multivariate 
normal random vectors. One has to do with the probability content of the ellipsoids 
of constant density. The other discusses the distribution of another form of linear 
combinations. 

The chi-square distribution determines the variability of the sample variance 
s* = 51, for samples from a univariate normal population, It also plays a basic role 
in the multivariate case. 


Result 4.7. Let X be distributed as N,(m, 2) with | 2] > 0. Then 

(a) (X — p)'=""(X — p) is distributed as v2, where x denotes the chi-square 
distribution with p degrees of freedom. 

(b) The N,(m#,%) distribution assigns probability 1 — @ to the solid ellipsoid 
{x: (x — p)'Z1(x - pw) Ss xX(a)}, where x7(a) denotes the upper (100a)th 
percentile of the xX distribution. 


Proof. We know that x3 is defined as the distribution of thesum Z? + Z3 + --- + Z5, 
where Z,,Z2,.-..,Z, are independent N(0,1) random variables. Next, by the 
spectral decomposition [see Equations (2-16) and (2-21) with A = %, and see 


Pp 
Result 4.1], 27 = > ~ ee, where Ye; = A,e;, s0 & 1e; = (1/A,)e;. Consequently, 


i=] “4 


(X— p)'E1(K - p) = ») (1/A,)(X — w)'ee/(X —p) = > (1/a) (e(X — w))? = 


P P 
> [G/VAi) ef(X - n)) = > Z?, for instance. Now, we can write Z = A(X — #2), 
i=] i=l 


164 Chapter 4 The Multivariate Normal Distribution 


where 

sls 
VA 

Z ; 
: =| 7 -| Va" 

(px1) : |? (pxp) : 

Zz; ‘ 
——e’' 
VAp Z 


and X — p is distributed as N, (0,2). Therefore, by Result 4.3, Z = A(X — ws is 
distributed as N,(0, AXA’), where 


Fah ee ay eee oe ae 
A 2 A’ =| VA A;e;e} || =e} epee! e 
(pxp)(pXP)(PXP) Z is ill yy 1 Vi 7i i VA, ‘| 


By Result 4.5, Z,, Z2,..., Zp are independent standard normal variables, and we 
conclude that (X — )'X"1(X — yw) has a y3-distribution. 

For Part b, we note that P[(X — y)'Z"1(K — ws) < c’] is the probability as- 
signed to the ellipsoid (KX — y)'X"1(K — ) < c* by the density N,(mu, 2). But 
from Part a, P[(X — y)'2"'(K — x) = y3(a)] = 1 - a, and Part b holds. rd] 


Remark: (Interpretation of statistical distance) Result 4.7 provides an interpreta- 
tion of a squared statistical distance. When X is distributed as N,(, 2), 


(X — w)'X71(X - pw) 


is the squared statistical distance from X to the population mean vector p. If one 
component has a much larger variance than another, it will contribute less to the 
squared distance. Moreover, two highly correlated random variables will contribute 
less than two variables that are nearly uncorrelated. Essentially, the use of the in- 
verse of the covariance matrix, (1) standardizes all of the variables and (2) elimi- 
nates the effects of correlation. From the proof of Result 4.7, 


(X~ pw) EK — pw) = Zp + ZZt--+ ZF 


The Multivariate Normal Density and Its Properties [65 


1 i 
In terms of 2 2 (see (2-22)),Z = Z 2(X — mw) hasaN,,(0,1,) distribution, and 


eae 
(X — m)'E"(K — w) = (K — w)'E 2B 2K - p) 
=Z'Z= 234+ Z3+---+ 23 


The squared statistical distance is calculated as if, first, the random vector X were 
transformed to p independent standard normal random variables and then the 
usual squared distance, the sum of the squares of the variables, were applied. 

Next, consider the linear combination of vector random variables 


eX, + cgX_. +--+ +¢,X%, = [K) |} Xi --- | Xa] (4-10) 
(pXn) (nx1) 


This linear combination differs from the linear combinations considered earlier in 
that it defines a p x 1 vector random variable that is a linear combination of vec- 
tors. Previously, we discussed a single random variable that could be written as a lin- 
ear combination of other univariate random variables. 


Result 4.8. Let X;, X2,...,X, be mutually independent with X; distributed as 
N,p(#;, %). (Note that each X; has the same covariance matrix X.) Then 


V; = cy X, + 2X2 i ae Cy&n 


n n 

is distributed as n, D cy, & 4)z), Moreover, V, and Vz = bX, + b,X2 
i=1 i=1 

+--+ + b,X, are jointly multivariate normal with covariance matrix 


e d)s | (b’c) 


j=l 
woz ( 03) 
j=l 
Consequently, V, and V, are independent if b’e = > c;b, = 0. 


MM) 
j=l 


Proof. By Result 4.5(c), the np component vector 


[Mats Xap Nowe y Xaps---s Xing] = [Xie Xb- Xa] = X 


is multivariate normal. In particular, : xX is distributed as N, p(s, ,), where 
npX1 


my a 
0x -:-:- O 


Bn 0o0o-- 


166 Chapter 4 The Multivariate Normal Distribution 


rae cl cI -:: oe] 
(2pxnp) bl bI sen 51 


where I is the p X p identity matrix, gives 


The choice 


< V; 
py bX; : : 
j= 


and AX is normal N,,(Ap, A%,A’) by Result 4.3. Straightforward block multipli. 
cation shows that A%,A’ has the first block diagonal term 


[c1%, co%,---, ¢,&] (eq, call,...,¢,1]' = (3 ds 


The off-diagonal term is 


[a%, CX, sas Cn] [bI, 41, ay b, 1)’ = (> cj) 


j=l 


This term is the covariance matrix for V,, V2. Consequently, when > cjb; = 
j=l 
n 


b’c = 0,so that e cj) = 0 , V, and V, are independent by Result 4.5(b). m 
j=l pxp 


’ For sums of the type in (4-10), the property of zero correlation is equivalent to 


requiring the coefficient vectors b and c to be perpendicular. 


Example 4.8 (Linear combinations of random vectors) Let X,, X.,X3, and X4 be 
independent and identically distributed 3 x 1 random vectors with 


3 3 -1 1 
m= {-1] and Y=|]-1 1 0 
= 1 , 1 0 2 


We first consider a linear combination a’ X, ot the three components of X,. This isa 
random variable with mean 


a’ = 3a; — a2 + a3 
and variance 
a’ a = 3a] + a} + 203 — 2aja, + 2a,0, 


That is, a linear combination a’X, of the components of a random vector is a single 
random variable consisting of a sum of terms that are each a constant times a variable. 
This is very different from a linear combination of random vectors, say, 


cy Xy + (2X2 ~~ 3X3 + c4X4 


The Multivariate Normal Density and Its Properties 167 


which is itself a random vector. Here each term in the sum is a constant times a 
random vector. 
Now consider two linear combinations of random vectors 


1 1 1 1 
7x + 3X2 + 7% + 3 Xs 


and 

XK, + X, + XK; -— 3X, 
Find the mean vector and covariance matrix for each linear combination of vectors 
and also the covariance between them. 


By Result 4.8 with c, = c, = cz; = cq = 1/2, the first linear combination has 
mean vector 


6 
(c) + C2 + C3 + Cy) = 2p =| —2 
2 
and covariance matrix 
3 -1 1 
(4+3+4+c)L=1xT=/-1 1 0 
1 0 2 


For the second linear combination of random vectors, we apply Result 4.8 with 
b, = b, = b; = 1 and b, = —3 to get mean vector 


0 
(b; + by + bz + by) = Ow = | 0 
0 
and covariance matrix 
36 —12 12 
(bt +63 +63 +63)2=12x Y=] -12 12 O 
12 0 24 


Finally, the covariance matrix for the two linear combinations of random vectors is 


(cb, + C2b, + c3b3 + cabs) X =02 = 


oo ° 


0 0 
0 0 
0 0 


Every component of the first linear combination of random vectors has zero 
covariance with every component of the second linear combination of random vectors. 
If, in addition, each X has a trivariate normal distribution, then the two linear 
combinations have a joint six-variate normal distribution, and the two linear combi- 
nations of vectors are independent. bl 


168 Chapter 4 The Multivariate Norma) Distribution 


4.3 Sampling from a Multivariate Normal Distribution 
and Maximum Likelihood Estimation 


We discussed sampling and selecting random samples briefly in Chapter 3. In this . 
section, we shall-be concerned with samples from a multivariate normal Popula- 
tion—in particular, with the sampling distribution of X and S. 


The Multivariate Normal Likelihood 


Let us assume that the p X 1 vectors Xj, X9,... .X,, represent a random sample - 
from a multivariate normal population with mean vector #2 and covariance matrix - 
&%. Since X,, X,,...,X,, are mutually independent and each has distribution 
N,(#, =), the joint density function of all the observations is the product of the 
marginal normal densities: 


Joint density = ud 1 ey MY ET xj~p)/2 
OPK yy Xopeas, X, (2ar)P?| ¥ M2 


“@ eee eh Or aye ew? (4-11) 
WT 


When the numerical values of the observations become available, they may be sub- 
stituted for the x; in Equation (4-11). The resulting expression, now considered as a func- 
tion of yz and & for the fixed set of observations x), X2,..., X,, is called the likelihood. 

Many good statistical procedures employ values for the population parameters 
that “best” explain the observed data. One meaning of best is to select the parame- 
ter values that maximize the joint density evaluated at the observations. This tech- 
nique is called maximum likelihood estimation, and the maximizing parameter . 
values are called maximum likelihood estimates. 

At this point, we shall consider maximum likelihood estimation of the parame- 
ters 4 and & for a multivariate normal population. To do so, we take the observa- 
tions x,,X2,...,X, as fixed and consider the joint density of Equation (4-11) 
evaluated at these values. The result is the likelibood function. In order to simplify 
matters, we rewrite the likelihood function in another form. We shall need some ad- 
ditional properties for the trace of a square matrix. (The trace of a matrix is the sum 
of its diagonal elements, and the properties of the trace are discussed in Definition 


2A.28 and Result 2A.12.) 
Result 4.9. Let A beak X & symmetric matrix and x be ak X 1 vector. Then 
(a) x’Ax = tr(x'Ax) = tr(Ax’) 


k 
(b) tr(A) = > A;, where the A; are the eigenvalues of A. 
i=] 


Proof. For Part a, we note that x‘Ax isa scalar, so x’Ax = tr(x’Ax). We pointed 
out in Result 2A.12 that tr (BC) = tr(CB) for any two matrices B aud C of 


dimensions m x k and k X m, respectively. This follows because BC has 5 bj jC;; a8 
f=) 


Sampling from a Multivariate Normal Distribution and Maximum Likelihood Estimation 169 


m k 
its ith diagonal element, so tr(BC) = > e bye) Similarly, the jth diagonal 
j=l 


m k m m k 
element of CB is }) c;:b;j, so tr(CB) = >) (= cys) => ( bien) = tr(BC). 
i=] 1 =1 1 


jul \iz iz \j= 
Let x’ be the matrix B with m = 1, and let Ax play the role of the matrix C. Then 
tr(x'(Ax)) = tr((Ax)x’), and the result follows. 

Part b is proved by using the spectral decomposition of (2-20) to write 
A = P’AP, where PP’ = I and A is a diagonal matrix with entries A,, Az,-.., Ax- 
Therefore, tr(A) = tr(P’AP) = tr(APP’) = tr(A) = Ay + Ag tees + Ag. _ 


Now the exponent in the joint density in (4-11) can be simplified. By Result 4.9(a), 
(xj— m)'E(x) — a) = trl (x) — w)’E"(x; - w)] 
= tr[X(x; — pw) (xj ~ w)'] (4-12) 
Next, 


> (x; — w)'E"(x, - w) = 3S te[(x, - w)'Ex, - w)] 
2 
tr[X (x; — mw) (x; - »)'] 


[=(S cw - w 05) - w)') | (4.13) 


j=l 


I 
Ms: 


H 
g 


since the trace of a sum of matrices is equal to the sum of the traces of the matrices, 
n 

according to Result 2A.12(b). We can add and subtract X = (1/n) >) x, in each 
j=l 


n 


term (x; — yt) in >) (x, — we) (x; — pw)’ to give 


j=l 


= ¥ (x; - #)(x, - x) + > Gwe Say 


7 ps (x) ~ X¥) (Kj — x)’ + n(x — pw) (X - fray (4-14) 
jz 


n n 
because the cross-product terms, >) (x; — X)(x — wr)’ and 5S) (X — w)(x; — x), 
jl jal 
are both matrices of zeros, (See Exercise 4.15.) Consequently, using Equations (4-13) 
and (4-14), we can write the joint density of a random sample from a multivariate 
normal population as 


{, oint density of 


= (2m) "P| & [one 
saad ( oe el 


x exp | -u[ 2S (x; ~ ¥)(x; — ¥)’ + n(X — w)(x - ny’) |/2} (4-15) 


170 Chapter 4 The Multivariate Normal Distribution 


Substituting the observed values xj, x2, ..., x,, into the joint density yields the likeli- 
hood function. We shall denote this function by L(t, %), to stress the fact that it isa 
function of the (unknown) population parameters x and &. Thus, when the vectors 
x; contain the specific numbers actually observed, we have 


ai deal paar Lea esc hy 
L(p, %) = (omyrP| 5 p2° us (3 (7-8) ()-3)" Fm) m) /> (4-16) 


It will be convenient in later sections of this book to express the exponent in the like- 
lihood function (4-16) in different ways. In particular, we shall make use of the identity 


olz(S (x; — ¥) (x; — X)' + n(x — ps) (x - »)’)] 
= ef( S os; - 05 - xy) + ee - mye - 0) 


zy 
= fz (x; — X)(xj - »’)| + n(% — py =x — gw) (4-17) 


n 
=1 
n 
=1 


Maximum Likelihood Estimation of and = 


The next result will eventually allow us to obtain the maximum likelihood estima- 
tors of wx and &. 


Result 4.10. Given a p X p symmetric positive definite matrix B and a scalar 
b > 0, it follows that 


ae ete (S"BY/2 < iG (2b) Pbe-bp 


for all positive definite >» . with equality holding only for Y = (1/2b)B. 
PXP 


Proof. Let B’” be the symmetric square root of B [see Equation (2-22)], 
so B?p?=B, B’BI?=1, and BB v=B!. Then tr(='B) = 
tr[(27BY?)B'2) = tr[B'7(21B”)]. Let 7 be an eigenvalue of B’/?'B'”. This 
matrix is positive definite because y'B’/?2 "By = (B’?y)'x"1(Bl7y) > 0 if 
B'/*y # 0 or, equivalently, y # 0. Thus, the eigenvalues 7; of B'/~1B/? are positive 
by Exercise 2.17. Result 4.9(b) then gives 


(2B) = o(B?5"'BY”) = Sn, 
i=1 


: p 
and |B1?"1B!/| = |] 7»; by Exercise 2.12. From the properties of determinants in 
i=l 
Result 2A.11, we can write 
|B’?5"BY?| = |B ||2> || B'?| = |=71|| Bi? || BY? | 
1 


=|Z7Bl = 
BI = 


|B| 


Sampling from a Multivariate Normal Distribution and Maximum Likelihood Estimation 171 


or é 
1 ‘i |BI2y-1Bl2| 2 I] Ni 
|>| |B| |B| 


Combining the results for the trace and the determinant yields 


I 
3 
nN 
o 
t 
Me 
3 
wT 
I 
an 


1 -w(3"By/2 ( 9 ee 
e = e;2 = ?e 7/2 
bay pp opp ll 


But the function 7°e” has a maximum, with respect to », of (2b)’e~, occurring at 
n = 2b. The choice »; = 2b, for each i, therefore gives 
1 o-tr(sB)/2 < oe (2b)Pbe-bp 
|B)? 


||? 
The upper bound is uniquely attained when & = (1/25)B, since, for this choice, 


B'?S1B!? = BY?(2b)B"B'? = (2b) I 
(pXp) 
and 
tr[X7B] = tr[B/?X BY] = tr[(2b)¥] = 2bp 
Moreover, 


|>| |B| «(BL [BI 
Straightforward substitution for tr[£“'B] and 1/||° yields the bound asserted. 


The maximum likelihood estimates of yx and & are those values—denoted by 
and &—that maximize the function L(y, %) in (4-16). The estimates a and & will 
depend on the observed values x,, X2,..., X,, through the summary statistics x and S. 


Result 4.11. Let X,,X2,...,X, be a random sample from a normal population 
with mean yz and covariance &. Then 


> 


B= s 


ni n 


are the maximum likelihood estimators of pw and %, respectively. Their observed 
n 
values, ¥ and (1/n) >) (x; — &)(x,; — x)’, are called the maximum likelihood esti- 
j=l 
mates of yx and &. 


Proof. The exponent in the likelihood function [see Equation (4-16)], apart from 
the multiplicative factor -}, is [see (4-17)] 


wla(S (x; ~ X) (x; - x’) + n(x — p)'S (x - pw) 


172 Chapter 4 The Multivariate Normal Distribution 


By Result 4.1, 7! is positive definite, so the distance (x — )'=71(K — wx) > Oun- 
less wz = x. Thus, the likelihood is maximized with respect to yx at ff = X. It remains 
to maximize 


L(n, x) = aoengpee (<)-3)tx,-3))] / 2 


na 
over &. By Result 4.10 with b = n/2 and B = 9)(x; — x)(x; — x)’, the maximum 
jl 


- occurs at & = (1/n) >) (x; — %)(x; — x)’, as stated. 
j=l 


The maximum likelihood estimators are random quantities. They are obtained by 
replacing the observations x), X2,...,X, in the expressions for # and % with the 
corresponding random vectors, X;, X2,..., X,.- ] 


We note that the maximum likelihood estimator X is a random vector and the 
maximum likelihood estimator { is a random matrix. The maximum likelihood 
estimates are their particular values for the given data set. In addition, the maximum 
of the likelihood is 


A 1 1 
L(, 2) = oer ak 
(A ) (2n)"P? | >» | nf{2 ( 8) 
or, since |&| = [(n — 1)/n]?|S|, 
L(k, 2) =.constant X (generalized variance)” (4-19) 


The generalized variance determines the “peakedness” of the likelihood function 
and, consequently, is a natural measure of variability when the parent population is 
multivariate normal. 2 

Maximum likelihood estimators possess an invariance property. Let @ be the 
maximum likelihood estimator of 8, and consider estimating the parameter h(@), 
which is a function of @. Then the maximum likelihood estimate of 


h(0) is given by h(6) (4-20) 
(a function of @) (same function of 0) 
(See [1] and [15].) For example, 
1. The maximum likelihood estimator of x= is as fs, where f = X and 


== ((n — 1)/n)S are the maximum likelihood estimators of mw and %, 
respectively. 
2. The maximum likelihood estimator of Vo;; is Vo;;, where 


“ 1 = 
dT aoa > (%, -*X)" 


j=l 


is the maximum likelihood estimator of o;; = War (Xj). 


The Sampling Distribution of K andS_ 173 


Sufficient Statistics 


From expression (4-15), the joint density depends on the whole set of observations 
X1,X2,---,X, Only through the sample mean x and the sum-of-squares-and-cross- 


products matrix >) (x; — x)(x; — x)’ = (n — 1)S. We express this fact by saying 
jl 


that x and (m — 1)S (or S) are sufficient statistics: 


Let X,, X2,..., X,, be a random sample from a multivariate normal population 
with mean yt and covariance Y. Then 


X and S are sufficient statistics (4-21) 


The importance of sufficient statistics for normal populations is that all of the 
information about yx and & in the data matrix X is contained in x and S, regardless 
of the sample size n. This generally is not true for nonnormal populations. Since 
many multivariate techniques begin with sample means and covariances, it is pru- 
dent to check on the adequacy of the multivariate normal assumption. (See Section 
4.6.) If the data cannot be regarded as multivariate normal, techniques that depend 
solely on x and § may be ignoring other useful sample information. 


4.4 The Sampling Distribution of X and S 


The tentative assumption that X,, X>,..., X,, constitute a random sample from a 
normal population with mean mw and covariance % completely determines the 
sampling distributions of X and S. Here we present the results on the sampling 
distributions of X and S by drawing a parallel with the familiar univariate 
conclusions. he 

In the univariate case (p = 1), we know that X is normal with mean pw = 
(population mean) and variance 
1, _ population variance 
Op mS = eee 
n sample size 
The result for the multivariate case (p = 2) is analogous in that X has a normal 
distribution with mean yz and covariance matrix (1/n)2. 

n 


For the sample variance, recall that (n — 1)s* = >) (X, -X y is distributed as 


=1 
o* times a chi-square variable having n — 1 degrees of freedom (d-f.). In turn, this 
chi-square is the distribution of a sum of squares of independent standard normal 
random variables. That is, (n ~ 1)s? is distributed as 0?(Z? + --» + Z2_,) = (oZ,)? 
+-++++ (oZ,-1)*. The individual terms oZ; are independently distributed as 
N(0, o”). It is this latter form that is suitably generalized to the basic sampling 
distribution for the sample covariance matrix. 


Chapter 4 The Multivariate Norma) Distribution 


The sampling distribution of the sample covariance matrix is called the Wishart 
distribution, after its discoverer; it Is defined as the sum of independent products of 
multivariate normal random vectors. Specifically, 

W,,(- |Z) = Wishart distribution with m df. (4-22) 
aw 
= distribution of 5) Z;Z; 


jl 


where the Z, are each independently distributed as N,(0, X). 
We summarize the sampling distribution results as follows: 


Let X,, Xo,...,X,, be a random sample of size n from a p-variate normal 
distribution with mean p and covariance matrix £. Then 


1. X is distributed as N,(u,(1/n)®). 
2. (a ~ 1)Sis distributed as a Wishart random matrix withn - 1d. (4-23) 


3. X and S are independent. 


Because © is unknown, the distribution of X cannot be used directly to make 
inferences about ys. However, S provides independent information about &, and the 
distribution of S does not depend on }s. This allows us to construct a statistic for 


making inferences about p, as we shall see in Chapter 5. 
For the present, we record some further results from multivariable distribution 


theory. The following properties of the Wishart distribution are derived directly 
from its definition as a sum of the independent products, Z,Z;. Proofs can be found 


in [1]. 


Properties of the Wishart Distribution 


1. If A; is distributed as W,,,(Ay |X) independently of Az, which is distributed as 
Wrn(Azi), then Ay + Ag is distributed 25 Win+m(Ar + Ao|%). That is, the 
degrees of freedom add. (4-24) 

2. If A is distributed as W,,(A |Z), then CAC’ is distributed as W,,(CAC’ | CXC’). 


Although we do not have any particular need for the probability density 
function of the Wishart distribution, it may be of some interest to see its rather 
complicated form. The density does not exist unless the sample size n is greater 
than the number of variables p. When it does exist, its value at the positive definite 
matrix A is 


[A lore DiRerttAz Ye 
. A positive definite 


Wy-(A | zr) = Pp 
PCAN) fggP(p~14| ¥ lV“ il T{$(n ~ i)) 
1 


(4-25) 


where I (-) is the gamma function. (See [1} and [11]}.) 


Large-Sample Behavior of X andS' !75 


4.5 Large-Sample Behavior of X and S 


Suppose the quantity X is determined by a Jarge number of independent causes 
V,, Vo,..., V,, where the random variables V; representing the causes have approxi- 
mately the same variability. If X is the sum 2 


X=V +, t---+¥, 


then the central limit theorem applies, and we conclude that X has a distribution 
that is nearly normal. This is true for virtually any parent distribution of the V;’s, pro- 
vided that n is large enough. 

The univariate central limit theorem also tells us that the sampling distribution 
of the sample mean, Y for a large sample size is nearly normal, whatever the 
form of the underlying population distribution. A similar result holds for many 
other important univariate statistics. 

It turns out that certain multivariate statistics, like K and S, have large-sample 
properties analogous to their univariate counterparts. As the sample size is in- 
creased without bound, certain regularities govern the sampling variation in X and 
S, irrespective of the form of the parent population. Therefore, the conclusions pre- 
sented in this section do not require multivariate normal populations. The only 
requirements are that the parent population, whatever its form, have a mean @ and 
a finite covariance &. 


Result 4.12 (Law of large numbers). Let Y,, Y2,..., ¥;, be independent observa- 
tions from a population with mean E(Y;) = yu. Then 


converges in probability to 4 as n increases without bound. That is, for any 
prescribed accuracy e > 0, P[-e < Y — pu < e] approaches unity as n — ©0. 


Proof. See [9]. = 


As a direct consequence of the law of large numbers, which says that each X; 
converges in probability to u;,7 = 1,2,.... p, 


X converges in probability to # (4-26) 
Also, each sample covariance s;, converges in probability too;,,i,k = 1,2,..., p,and 
S(or y= S,,) converges in probability to & (4-27) 


Statement (4-27) follows from writing 


{n — 1)six > (Xj; ~ Xi) (Xiu — Xe) 
= 


> (Xj — me + Ki ~ Xj) (Xie — ba + ba — Xe) 
i= 


n 


D (Xie — oi) (Xie — Ma) + 1% — wi) Xe ~ be) 


j=1 


176 Chapter 4 The Multivariate Norma! Distribution 


Letting ¥ = (Xj — wi) Xie — He), with E(¥}) = o;,, we see that the first term in 
5,, converges to o;, and the second term converges to zero, by applying the law of 
large numbers. 

The practical interpretation of statements (4-26) and (4-27) is that, with high 
probability, X will be close to x and S will be close to X whenever the sample size is 
large. The statement concerning X is made even more precise by a multivariate 
version of the central limit theorem. 


Result 4.13 (The central limit theorem). Let X,, X2,...,X,, be independent 
observations from any population with mean yx and finite covariance &. Then 


Vn (X ~ yw) has an approximate N,(0, X) distribution 


for large sample sizes. Here n should also be large relative to p. 
Proof. See [1]. y m 


The approximation provided by the central limit theorem applies to dis- 
crete, as well as continuous, multivariate populations. Mathematically, the limit 
is exact, and the approach to normality is often fairly rapid. Moreover, from the 
results in Section 4.4, we know that X is exactly normally distributed when the 
underlying population is normal. Thus, we would expect the central limit theo- 
rem approximation to be quite good for moderate n when the parent population 
is nearly normal. 

As we have seen, when n is large, S is close to 2 with high probability. Conse- 
quently, replacing & by S in the approximating normal distribution for X will have a 
negligible effect on subsequent probability calculations. 

Result 4.7 can be used to show that n(X — m«)'Z!(K - yw) hasa y%, distribution 


we ohare 1 a 
when X is distributed as ilu, re z) or, equivalently, when Vn (K — yt) has an 
N,(0, &) distribution. The X% distribution is approximately the sampling distribution 
of n(X — wt)’ X1(K — mw) when X is approximately normally distributed. Replac- 
ing &! by S$ does not seriously affect this approximation for n large and much 


greater than p. 
We summarize the major conclusions of this section as follows: 


Let X,, X2,--., X,, be independent observations from a population with mean 
# and finite (nonsingular) covariance ©. Then 


Vn (X ~ p) is approximately N, (0, 2) 
and (4-28) 
n(X — w)'S"(X — w) is approximately x2 
forn — p large. 
In the next three sections, we consider ways of verifying the assumption of nor- 


mality and methods for transforming nonnormal observations into observations 
that are approximately normal. 


Assessing the Assumption of Normality 177 


4.6 Assessing the Assumption of Normality 


As we have pointed out, most of the statistical techniques discussed in subsequent 
chapters assume that each vector observation X; comes from a multivariate normal 
distribution. On the other hand, in situations where the sample size is large and the 
techniques depend solely on the behavior of X, or distances involving X of the form 
n(X — y)'S"(X — yw), the assumption of normality for the individual observa- 
tions is less crucial. But to some degree, the quality of inferences made by these 
methods depends on how closely the true parent population resembles the multi- 
variate normal form. It is imperative, then, that procedures exist for detecting cases 
where the data exhibit moderate to extreme departures from what is expected 
under multivariate normality. 

We want to answer this question: Do the observations X; appear to violate the 
assumption that they came from a normal population? Based on the properties of 
normal distributions, we know that all linear combinations of normal variables are 
normal and the contours of the multivariate normal density are ellipsoids. There- 
fore, we address these questions: 


1. Do the marginal distributions of the elements of X appear to be normal? What 
about a few linear combinations of the components X;? 


2. Do the scatter plots of pairs of observations on different characteristics give the 
elliptical appearance expected from normal populations? 


3. Are there any “wild” observations that should be checked for accuracy? 


It will become clear that our investigations of normality will concentrate on the 
behavior of the observations in one or two dimensions (for example, marginal dis- 
tributions and scatter plots). As might be expected, it has proved difficult to con- 
struct a “good” overall test of joint normality in more than two dimensions because 
of the large number of things that can go wrong. To some extent, we must pay a price 
for concentrating on univariate and bivariate examinations of normality: We can 
never be sure that we have not missed some feature that is revealed only in higher 
dimensions. (It is possible, for example, to construct a nonnormal bivariate distribu- 
tion with normal marginals, [See Exercise 4.8.]) Yet many types of nonnormality are 
often reflected in the marginal distributions and scatter plots. Moreover, for most 
practical work, one-dimensional and two-dimensional investigations are ordinarily 
sufficient. Fortunately, pathological data sets that are normal in lower dimensional 
representations, but nonnormal in higher dimensions, are not frequently encoun- 
tered in practice. 


Evaluating the Normality of the Univariate Marginal Distributions 


Dot diagrams for smaller n and histograms for n > 25 or so help reveal situations 
where one tail of a univariate distribution is much Jonger than the other. If the his- 
togram for a variable X; appears reasonably symmetric, we can check further by 
counting the number of observations in certain intervals. A univariate normal distri- 
bution assigns probability .683 to the interval (4; — Vojj, 4; + Wojj;) and proba- 
bility .954 to the interval (u; — 2Vo;;, 4; + 2Vo;;). Consequently, with a large 
sample size n, we expect the observed proportion p;, of the observations lying in the 


178 Chapter 4 The Multivariate Normal Distribution 


interval (¥; — Ws;;, ¥: + Vsj;) to be about .683. Similarly, the observed proportion 


Piz of the observations in (X; — 2Vs;;, X; + 25) should be about .954. Using the 
normal approximation to the sampling distribution of p; (see [9]), we observe that 


either 
x (.683)(317) 1.396 
D;; ~ 683| > 3 ,{/ ———— = — = 
| Pia | A Vn 
or 


(954)(046) 628 

Vn 
would indicate departures from an assumed normal distribution for the ith charac- 
teristic. When the observed proportions are too small, parent distributions with 
thicker tails than the normal are suggested. 

Plots are always useful devices in any data analysis. Special plots called Q-Q 
plots can be used to assess the assumption of normality. These plots can be made for 
the marginal distributions of the sample observations on each variable. They are, in 
effect, plots of the sample quantile versus the quantile one would expect to observe if 
the observations actually were normally distributed. When the points lie very nearly 
along a Straight line, the normality assumption remains tenable. Normality is suspect 
if the points deviate from a straight line. Moreover, the pattern of the deviations can 
provide clues about the nature of the nonnormality. Once the reasons for the non- 
normality are identified, corrective action is often possible. (See Section 4.8.) 

To simplify notation, let x,,x2,..., x, represent n observations on any single 
characteristic X;. Let xz) = x2) =+-* = Xn) represent these observations after 
they are ordered according to magnitude. For example, x,2) is the second smallest 
observation and x,,) is the largest observation. The x;;)’s are the sample quantiles. 
When the x,j;) are distinct, exactly j observations are less than or equal to X,j)- (This 
is theoretically always true when the observations are of the continuous type, which 
we usually assume.) The proportion j/n of the sample at or to the left of x,j) is often 


approximated by (j = 3\/n for analytical convenience. 
For a standard normal distribution, the quantiles q/;) are defined by the relation 


- | Biz - 954) >3 (4-29) 


qh) ] 2 La 
P[Z <= a) = a Vane /2 dz = Puy = area (4-30) 


(See Table 1 in the appendix). Here pyj) is the probability of getting a value less than 
or equal to qi,) in a single drawing from a standard normal population. 

The idea is to look at the pairs of quantiles (q,), xj;)) with the same associated 
cumulative probability ( in Vin. If the data arise from a normal population, the 
pairs (q(;), x(j)) will be approximately linearly related, since oq,) + w is nearly the 
expected sample quantile.’ 


The 1 in the numerator of ( j- Vin is a “continuity” correction. Some authors (see [5] and [10]) 


have suggested replacing ( j- yy nby (i 7 3)/ (n ot 2): 
7A better procedure is to plot (mij), Xj), Where my = E(z)) is the expected value of the jth- 
order statistic in a sample of size n from a standard normal distribution. (See [13] for further discussion.) 


Assessing the Assumption of Normality 179 


Example 4.9 (Constructing a Q-Q plot) A sample of m = 10 observations gives the 
values in the following table: 


Ordered - 
observations Brobabilty levels Standard normal 

X(j) (i = 1)/n quantiles Qj) 

—1.00 OS —1.645 

—.10 15 —1.036 

16 25 —.674 

41 35 —.385 

62 45 —.125 

86 55 125 

1.26 65 385 

1.54 75 .674 

1.71 85 1.036 

2.30 95 1.645 


385 
LPR dpa 65. [See (4-30).] 
27 


Here, for example, P[Z = .385] = i 


-oO 

Let us now construct the Q-Q plot and comment on its appearance. The Q-Q 

plot for the foregoing data, which is a plot of the ordered data x;;) against the nor- 

mal quantiles q,;), is shown in Figure 4.5. The pairs of points (qj) , x(j)) lie very near- 

ly along a straight line, and we would not reject the notion that these data are 
normally distributed—particularly with a sample size as small as n = 10. 


== P ar rear > Ij) 


Figure 4.5 A Q-Q plot for the 
data in Example 4.9. _ 


The calculations required for Q-Q plots are easily programmed for electronic 
computers. Many statistical programs available commercially are capable of produc- 
ing such plots. 

The steps leading to a Q-Q plot are as follows: 

1. Order the original observations to get x(1), X(2),---, X(n) and their corresponding 

probability values (1 = 5)/n, (2 - Vin, fay (n ~ 3\/n; 

2. Calculate the standard normal quantiles 9/1), q(2),---» 42); and 
3. Plot the pairs of observations (4,1), X(1))s (9(2)> X(2) s+ =» (G(n)s X(n)), and exam- 
ine the “straightness” of the outcome. 


180 Chapter 4 The Multivariate Normal Distribution 


Q-Q plots are not particularly informative unless the sample size is moderate to 
large—for instance, a = 20. There can be quite a bit of variability in the straightness 
of the Q—Q plot for small samples, even when the observations are known to come 


from a normal population. 


~ 


Example 4.10 (A Q-Q plot for radiation data) The quality-contro] department of a 
manufacturer of microwave ovens is required by the federal government to monitor 
the amount of radiation emitted when the doors of the ovens are closed. Observa- 
tions of the radiation emitted through closed doors of n = 42 randomly selected 


ovens were made. The data are listed in Table 4.1. 


Table 4.1 Radiation Data (Door Closed) 
Oven Oven Oven 
no. Radiation no. Radiation no. Radiation 
1 15 16 10 
2 09 17 
3 18 18 
4 10 19 
5 05 20 
6 12 21 
7 08 22 
8 05 23 
9 .08 24 
10 10 25 
11 07 26 
12 02 27 
13 :01 28 
14 10 29 
15 10 30 
Source: Data courtesy of J. D. Cryer. 


In order to determine the probability of exceeding a prespecified tolerance 
level, a probability distribution for the radiation emitted was needed. Can we regard 
the observations here as being normally distributed? 

A computer was used to assemble the pairs (4), x(;)) and construct the Q-Q 
plot, pictured in Figure 4.6 on page 181, It appears from the plot that the data as 
a whole are not normally distributed. The points indicated by the circled locations in 
the figure are outliers—values that are too large relative to the rest of the 
observations. 

For the radiation data, several observations are equal. When this occurs, those 
observations with like values are associated with the same normal quantile. This 


quantile is calculated using the average of the quantiles the tied observations would 


have if they all differed slightly. a 


¥() 


Assessing the Assumption of Normality (81 


40 
30 © 
3 
20 2 
3 
ee 
29 
10 3 
3 
2 
OOF, | aa —l—» 4;) 
-2.0 ~1.0 0 1.0 3.0 


Figure 4.6 A Q-@Q plot of 
the radiation data (door 
closed) from Example 4.10. 
(The integers in the plot 
indicate the number of 
points occupying the same 
location.) 


The straightness of the Q-Q plot can be measured by calculating the correlation co- 
efficient of the points in the plot. The correlation coefficient for the Q—O plot is defined by 


> (x — ¥)(4i) — 9) 
a 


FOse " 


ny > (qu) ~ 


(4-31) 


and a powerful test of normality can be based on ac (See [5], [10], and [12].) Formally, 
we reject the hypothesis of normality at level of significance a if rg falls below the 


appropriate value in Table 4.2. 


Table 4.2 Critical Points for the Q-Q Plot 


Correlation Coefficient Test for Normality 

Sample size Significance levels a 
n 01 05 10 

5 8299 =.8788 = .9032 
10 8801 9198  .9351 
15 9126 9389 = 9503 
20 9269 9508  .9604 
25 9410 9591 .9665 
30 9479 = 9652 9715 
35 9538 9682  .9740 
40 9599 .9726 ~—-.9771 
45 9632 .9749 9792 
50 9671 .9768 — .9809 
55 9695 9787 9822 
60 -9720 = .9801 .9836 
75 9771 = .9838 ~—-.9866 
100 9822 9873 9895 
150 .9879 9913 = .9928 
200 9905  .9931 .9942 
300 9935 .9953 .9960 


182 Chapter4 The Multivariate Normal Distribution 


Example 4.1! (A correlation coefficient test for normality) Let us calculate the cor- 
relation coefficient ro from the Q-Q plot of Example 4.9 (see Figure 4.5) and test 


for normality. 
Using the information from Example 4.9, we have ¥ = .770 and 


~ 


10 10 10 
> (xq — ¥)qy) = 8.584, p> (xy — ¥)° = 8472, and > Gi) = 8.795 
it i iF 


Since always, g = 0, 
8.584 
> SS 
° VB472 VB.795 
A test of normality at the 10% level of significance is provided by referring rg = .994 


tothe entry in Table 4.2 corresponding tom = 10anda@ = .10. This entry is .9351. Since 
Tg > 9351, we do not reject the hypothesis of normality. a 


= .994 


Instead of rg, some software packages evaluate the original statistic proposed 
by Shapiro and Wilk [12]. Its correlation form corresponds to replacing qj) by a 
function of the expected value of standard normal-order statistics and their covari- 
ances. We prefer rg because it corresponds directly to the points in the normal- 
scores plot. For large sample sizes, the two statistics are nearly the same (see [13]), so 
either can be used to judge lack of fit. 

Linear combinations of more than one characteristic can be investigated. Many 
statisticians suggest plotting 


éix; where Sé, = Ajé, 


in which A, is the largest eigenvalue of S. Here Xj = [xj1, x,2,..-, Xp] is the jth 
observation on the p variables X,, X2,..., X,. The linear combination €x; corre- 
sponding to the smallest eigenvalue is also frequently singled out for inspection. 
(See Chapter 8 and [6] for further details.) 


Evaluating Bivariate Normality 


We would like to check on the assumption of normality for all distributions of 
2, 3,..., p dimensions. However, as we have pointed out, for practical work it is usu- 
ally sufficient to investigate the univariate and bivariate distributions. We consid- 
ered univariate marginal distributions earlier. It is now of interest to examine the 
bivariate case. 

In Chapter 1, we described scatter plots for pairs of characteristics. If the obser- 
vations were generated from a multivariate normal distribution, each bivariate dis- 
tribution would be normal, and the contours of constant density would be ellipses. 
The scatter plot should conform to this structure by exhibiting an overall pattern 
that is nearly elliptical. 

Moreover, by Result 4.7, the set of bivariate outcomes x such that 


(x — w)'=1(x ~ w) = ¥4(.5) 


Assessing the Assumption of Normality 183 


has probability .5. Thus, we should expect roughly the same percentage, 50%, of 
sample observations to lie in the ellipse given by 


{all x such that (x ~ x)/S7(x — x) = y3(.5)} 
where we have replaced m by its estimate x and 7! by its estimate S"’. If not, the 


normality assumption is suspect. 


Example 4.12 (Checking bivariate normality) Although not a random sample, data 
consisting of the pairs of observations (x, = sales, x. = profits) for the 10 largest 
companies in the world are listed in Exercise 1.4. These data give 


x = | 155.60 g = | 7476.45 303.62 
14.70 |’ ~ { 303.62 26.19 


e 1 i 26.19 ae 


so 


~ 103,623.12 | ~303.62 7476.45 


.000253 -—.002930 
~,002930 072148 


From Table 3 in the appendix, x3(.5) = 1.39. Thus, any observation x’ = [X3, x2] 
satisfying 


x, ~ 155.60 |’; 000253 -.002930 || x, — 155.60 | _ +36 
X2 — 14.70 —.002930 072148 || x, - 14.70 | ~~ 


is on or inside the estimated 50% contour. Otherwise the observation is outside this 
contour. The first pair of observations in Exercise 1.4 is (x1, x2]' = [108.28, 17.05]. 
In this case 


108.28 — 155.60 }' 000253 —.002930 | | 108.28 — 155.60 
17.05 — 14.70 ~.002930 072148 17.05 — 14.70 


= 1.61 > 1.39 


and this point falls outside the 50% contour. The remaining nine points have gener- 
alized distances from x of .30, .62, 1.79, 1.30, 4.38, 1.64, 3.53, 1.71, and 1.16, respec- 
tively. Since four of these distances are less than 1.39, a proportion, .40, of the data 
falls within the 50% contour. If the observations were normally distributed, we 
would expect about half, or 5, of them to be within this contour. This difference in 
proportions might ordinarily provide evidence for rejecting the notion of bivariate 
normality; however, our sample size of 10 is too small to reach this conclusion. (See 
also Example 4.13.) a 


Computing the fraction of the points within a contour and subjectively compar- 
ing it with the theoretica] probability is a useful, but rather rough, procedure. 


184 Chapter4 The Multivariate Normal Distribution 


A somewhat more formal method for judging the joint normality of a data set is 
based on the squared generalized distances 


d; = (x; - x)'S""(x; — x), jHt2,....0 (4-32) 


where x), Xp, ---»X, are the sample observations The procedure we are about to de- 
scribe is not limited to the bivariate case; it can be used for all p = 2. 

When the parent population is multivariate normal and both n and n — p are 
greater than 25 or 30, each of the squared distances d?, d3,...,d2 should behave 
like a chi-square random variable, (See Result 4.7 and Equations (4-26) and (4-27),] 
Although these distances are not independent or exactly chi-square distributed, it is 
helpful to plot them as if they were. The resulting plot is called a chi-square plot oy 
gamma plot, because the chi-square distribution is a special case of the more general 
gamma distribution. (See {6].) 

To construct the chi-square plot, 


1. Order the squared | distances in (4-32) from smallest to largest as 


dt = dy = 7+ = 4ny- 


2, Graph the pairs (q.o((7 -3]/n). di), where g.p((j—5)/n) is the 
100( j- Vin quantile of the chi-square distribution with p degrees of freedom. 


Quantiles are specified in terms of proportions, whereas percentiles are speci- 


fied in terms of percentages. ; 
The quantiles e.p( = 1\in) are related to the upper percentiles of a 


chi-squared distribution. In particular, g., »{{j — 3)/n) = x3{(n - j + 4)/n). 

The plot should resemble a straight line through the origin having slope 1.A 
systematic curved pattern suggests lack of normality. One or two points far above 
the line indicate large distances, or outlying observations, that merit further 


attention. 


—— 
Example 4.13 (Constructing a chi-square plot) Let us construct a chi-square plot of 
the generalized distances given in Example 4.12. The ordered distances and the 
corresponding chi-square percentiles for p = 2 andn = 10 are listed in the follow- 


ing table: 
ae oi 
J at) 9e2 ¢ 10 ‘) 
1 30 10 
2 -62 33 
3 1.16 58 
4 1.30 86 
5 1.61 1.20 
6 1.64 1.60 
7 1.71 2.10 
8 1.79 2.77 
9 3.53 3.79 
10 4.38 5.99 


Assessing the Assumption of Normality 185 


di 
e 
e 
e ° bs . 
e 
e 
e 

1 

eS ee eS Pt tet 1) 
0 1 2 3 4 5 6 7 


Figure 4.7 A chi-square plot of the ordered distances in Example 4.13. 


A graph of the pairs (4e,2( (i = 3)/10), d?,)) is shown in Figure 4.7. The points in 
Figure 4.7 are reasonably straight. Given the small sample size it is difficult to 
reject bivariate normality on the evidence in this graph. If further analysis of the 
data were required, it might be reasonable to transform them to observations 
more nearly bivariate normal. Appropriate transformations are discussed in 
Section 4.8. = 


In addition to inspecting univariate plots and scatter plots, we should check mul- 
tivariate normality by constructing a chi-squared or d” plot. Figure 4.8 contains a 


45 
wo 
e 
8A e 
ae" 
| e 
: of 
ee. 
4-4 
27] i 
oe 
q,4(U- 30) 
0 2 4 6 8 0 12 0 2 4 6 8 0 12 


Figure 4.8 Chi-square plots for two simulated four-variate normal data sets with n = 30. 


186 Chapter 4 The Multivariate Nominal Distribution 


plot 


vectors. As expected, the plots ha’ 
ordered squared distances are quit 
The next example contains ar 


that produced the plots in Figure 48. 


Example 4.14 (Evaluating 
data in Table 4.3 were 0 
X1, X2,X3, and x4, of each o 
a shock wave down the board, 


ing 


squared distances d? = ( 


Table 4.3 Four Measurements of Stiffness 


Observation - 
no. xy X2 X3 X4 ad 
1 1889 1651 1561 1778  .60 
2 2403 2048 2087 2197 5.48 
3 2119 1700 1815 2222 7.62 
4 1645 1627 1110 1533 5.21 
5. 1976 1916 1614 1883 1.40 
6 1712 1712 1439 1546 222 
7 1943 1685 1271 1671 4.99 
8 2104 1820 1717 1874 1.49 
9 2983 2794 2412 2581 12.26 
10 1745 1600 1384 1508 .77 
11 1710 1591 1518 1667 1.93 
12 2046 1907 1627 1898 .46 
13 1840 1841 1595 1741 2.70 
14 1867 1685 1493 1678 .13 
| 15 1859 1649 1389 1714 1.08 


s based on two computer-generated samples of 30 four-variate normal random 
s have a Straight-line pattern, but the top two or three 
e variable. 
eal data set comparable to the simulated data set 


multivariate normality for a four-variable data set) The 
btained by taking four different measures of stiffness, 
f 2 = 30 boards. The first measurement involves sending 
the second measurement is determined while vibrat- 
the board, and the last two measurements are obtained from static tests. The 
x; ~ X)'S'(x; — ¥) are also presented in the table. 


Observation 
no. 


x| 


1954 
1325 
1419 
1828 
1725 
2276 
1899 
1633 
2061 
1856 
1727 
2168 
1655 
2326 
1490 


X2 X3 X4 da? 
2149 1180 1281 16.85 
1170 1002 1176 3.50 
1371 1252 1308 3.99 
1634 1602 1755 1.36 
1594 1313 1646 1.46 
2189 1547 2111 9.90 
1614 1422 1477 5.06 
1513 1290 1516 . 
1867 1646 2037 2.54 
1493 1356 1533 4.58 
1412 1238 1469 3.40 
1896 1701 1834 2.38 
1675 1414 1597 3.00 
2301 2065 2234 6.28 
1382 1214 1284 2.58 


The marginal distributions appeat quite normal (see Exercise 4.33), with the 
possible exception of specimen (board) 9. 


To further evaluate multivariate normality, 
shown in Figure 4.9. The two s 
ly removed from the straight- 


two, they ma 


sion of this plot in Example 4.15. 


we constructed the chi-square plot 
pecimens with the largest squared distances are clear- 
line pattern. Together, with the next largest point or 
ke the plot appear curved at the upper end. We will return to a discus- 


We have discussed some rather simple techniques for checking the multivariate 


normality ass 


umption. Specifically, we advocate calculating the d?, j = 1,2,...,7 


[see Equation (4-32)] and comparing the results with x quantiles. For example, 
p-variate normality is indicated if 


1. Roughly half of the d} are less than or equal to q, p(.50). 


Detecting Outliers and Cleaning Data 187 


a 
Ww 


Sm Ili 3/30) 
12 


Figure 4.9 A chi-square plot for the data in Example 4.14. 


2. A plot of the ordered squared distances d?,) = d?,) =--- = d7,) versus 


i-th) (2c nai) evel | 
Gep\ fr dep fo Fev id respectively, is nearly a straight 


line having slope 1 and that passes through the origin. 


(See [6] for a more complete exposition of methods for assessing normality.) 

We close this section by noting that all measures of goodness of fit suffer the same 
setious drawback. When the sample size is small, only the most aberrant behavior will 
be identified as lack of fit. On the other hand, very large samples invariably produce 
statistically significant lack of fit. Yet the departure from the specified distribution 
may be very small and technically unimportant to the inferential conclusions. 


4.7 Detecting Outliers and Cleaning Data 


Most data sets contain one or a few unusual observations that do not seem to be- 
long to the pattern of variability produced by the other observations. With data 
on a single characteristic, unusual observations are those that are either very 
large or very small relative to the others. The situation can be more complicated 
with multivariate data. Before we address the issue of identifying these outliers, 
we must emphasize that not all outliers are wrong numbers. They may, justifiably, 
be part of the group and may lead to a better understanding of the phenomena 
being studied. 


188 Chapter 4 The Multivariate Normal Distribution 


Outliers are best detected visually whenever this is possible. When the number 
of observations n is large, dot plots are not feasible. When the number of character- 
istics p is large, the large number of scatter plots p(p ~ 1)/2 may prevent viewing 
them all. Even so, we suggest first visually inspecting the data whenever possible. 

What should we look for? For a single random variable, the problem is one di- 
mensional, and*we look for observations that are far from the others. For instance, 


the dot diagram 
ee 
ee ce e 
ee @ece @ ecceoecee e000 @ eee (0) 


pt er 


reveals a single large observation which is circled. 

In the bivariate case, the situation is more complicated. Figure 4.10 shows a 
situation ‘with two unusual observations. 

The data point circled in the upper right corner of the figure is detached 
from the pattern, and its second coordinate is large relative to the rest of the x, 


© O) 
e 
2 e . e 
e 
e 

°33 . ; 

33 e ° we % 
© O) 


e@ @@ @ cote oi 2. eo cose e@ & e 


Figure 4.10 Two outliers; one univariate and one bivariate. 


Detecting Outliers and Cleaning Data 189 


measurements, as shown by the vertical dot diagram. The second outlier, also cir- 
cled, is far from the elliptical pattern of the rest of the points, but, separately, each of 
its components has a typical value. This outlier cannot be detected by inspecting the 
marginal dot diagrams. 

In higher dimensions, there can be outliers that cannot-be detected from the 
univariate plots or even the bivariate scatter plots. Here a large value of 
(x; — ®)'S"'(x; — X) will suggest an unusual observation, even though it cannot be 
seen visually. 


Steps for Detecting Outliers 


1. Make a dot plot for each variable. 

2. Make a scatter plot for each pair of variables. 

3. Calculate the standardized values 2jx = (xjx — Xx)/Vsxx for j = 1,2,...,7 
and each column k = 1,2,..., p. Examine these standardized values for large 
or small values. 


4. Calculate the generalized squared distances (x; — x)’S"'(x; — x). Examine 
these distances for unusually large values. In a chi-square plot, these would be 
the points farthest from the origin. 


In step 3, “large” must be interpreted relative to the sample size and number of 
variables. There are n X p standardized values. When n = 100 and p = 5, there are 
500 values. You expect 1 or 2 of these to exceed 3 or be less than —3, even if the data 
came from a multivariate distribution that is exactly normal. As a guideline, 3.5 
might be considered large for moderate sample sizes. 

In step 4, “large” is measured by an appropriate percentile of the chi-square dis- 
tribution with p degrees of freedom. If the sample size is n = 100, we would expect 
5 observations to have values of da? that exceed the upper fifth percentile of the chi- 
square distribution. A more extreme percentile must serve to determine observa- 
tions that do not fit the pattern of the remaining data. 

The data we presented in Table 4.3 concerning lumber have already been 
cleaned up somewhat. Similar data sets from the same study also contained data on 
X5 = tensile strength. Nine observation vectors, out of the total of 112, are given as 
rows in the following table, along with their standardized values. 


xy X2 x3 X4 X5 zy 22 Z3 24 Z5 

1631 1528 1452 1559 1602 06 —15 05 28 —.12 
1770 1677 1707 1738 1785 64 43 107 94 60 
1376 1190 723 1285 2791 ~101 -147 -287 -.73 

1705 1577 1332 1703 1664 37 04 ~-43 81 13 
1643 1535 1510 1494 1582 11 ~.12 28 04 —.20 
1567 1510 1301 1405 1553 -21 -22 -56 ~28 -.31 
1528 1591 1714 1685 1698 —.38 10 110 .75 26 
1803 1826 1748 2746 1764 78 101 1.23 52 
1587 1554 1352 1554 1551 Sd iS) S435. 26: ~ 82 


190 Chapter 4 The Multivariate Normal Distribution 


The standardized values are based on the sample mean and variance, calculated 
from all 112 observations. There are two extreme standardized values. Both are too large 
with standardized values over 4.5. During their investigation, the researchers recorded 
measurements by hand in a logbook and then performed calculations that produced the 
values given in the table. When they checked their records regarding the values pin- 
pointed by this analysis, errors were discovered. The value x; = 2791 was corrected to 
1241, and x, = 2746 was corrected to 1670. Incorrect readings on an individual variable 
are quickly detected by locating a large leading digit for the standardized value. 

The next example returns to the data on lumber discussed in Example 4.14. 
Example 4.15 (Detecting outliers in the data on lumber) Table 4.4 contains the data 
in Table 4.3, along with the standardized observations. These data consist of four 
different measures of stiffness x), X2, x3, and x4, on each of nm = 30 boards. Recall 
that the first measurement involves sending a shock wave down the board, the second 
measurement is determined while vibrating the board, and the last two measurements 
are obtained from static tests. The standardized measurements are 


Table 4.4 Four Measurements of Stiffness with Standardized Values 


xy Xx x3 X4 Observation no. q 2 23 24 
1 -1 -.3 2 2 
1.5 9 1.9 1.5 
7 -.2 1.0 15 
~8 -4 -13 —.6 
2 5 3 5S 
~6 ~1 2. —6 
1 ~.2 —8 —.2 
6 2 7 5S 1.49 
3.3 3.3 3.0 2.7 226 
-5 = —.4 =.7 77 
—6 = 0 2 1.93 
A 5 A 5 46 
ae 3 3 0 2.70 
-1 so —.1 1 13 
4 ~-.0 1.08 


3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 


Detecting Outliers and Cleaning Data !91 


1500 2500 2400 
e 
16 7 
Fa 
fe} 
1 ote, a 
x 
Ro “7 
RE g 
8 ca 
(e) 
s 
a 
a 


1600 


2400 
1000 


1800 


1200 


Try 
1500 2500 1000 1600 2200 


Figure 4.11 Scatter plots for the lumber stiffness data with specimens 9 and 16 plotted as solid dots. 


k =1,2,3,4; j=1,2,...,30 


and the squares of the distances are d} = (x; — X)'S"!(x; — X). 

The last column in Table 4.4 reveals that specimen 16 is a multivariate outlier, 
since y4(.005) = 14.86; yet all of the individual measurements are well within their 
respective univariate scatters. Specimen 9 also has a large d? value. 

The two specimens (9 and 16) with large squared distances stand out as clearly 
different from the rest of the pattern in Figure 4.9. Once these two points are 
removed, the remaining pattern conforms to the. expected straight-line relation. 
Scatter plots for the lumber stiffness measurements are given in Figure 4.11 above. 


192 Chapter 4 The Multivariate Normal Distribution 


The solid dots in these figures correspond to specimens 9 and 16. Although the dot for 
specimen 16 stands out in all the plots, the dot for specimen 9 is “hidden” in the scat- 
ter plot of x3 versus x4 and nearly hidden in that of x, versus x3. However, specimen 9 
js clearly identified as a multivariate outlier when all four variables are considered. 
Scientists specializing in the properties of wood conjectured that specimen 9 
was unusually cléar and therefore very stiff and strong. It would also appear that 
specimen 16 is a bit unusual, since both of its dynamic measurements are above ay- 
erage and the two static measurements are low. Unfortunately, it was not possible to 
investigate this specimen further because the material was no longer available. my 


If outliers are identified, they should be examined for content, as was done in 
the case of the data on lumber stiffness in Example 4.15. Depending upon the 
nature of the outliers and the objectives of the investigation, outliers may be delet- 
ed or appropriately “weighted” in a subsequent analysis. 

Even though many statistical techniques assume normal populations, those 
based on the sample mean vectors usually will not be disturbed by a few moderate 
outliers. Hawkins [7] gives an extensive treatment of the subject of outliers. 


4.8 Transformations to Near Normality 


If normality is not a viable assumption, what is the next step? One alternative is to 
ignore the findings of a normality check and proceed as if the data were normally 
distributed. This practice is not recommended, since, in many instances, it could lead 
to incorrect conclusions. A second alternative is to make nonnormal data more 
“normal looking” by considering transformations of the data. Normal-theory analy- 
ses can then be carried out with the suitably transformed data. 

Transformations are nothing more than a reexpression of the data in different 
units. For example, when a histogram of positive observations exhibits a long right- 
hand tail, transforming the observations by taking their logarithms or square roots 
will often markedly improve the symmetry about the mean and the approximation 
to a normal distribution. It frequently happens that the new units provide more 
natural expressions of the characteristics being studied. 

Appropriate transformations are suggested by (1) theoretical considerations or 
(2) the data themselves (or both). It has been shown theoretically that data that are 
counts can often be made more normal by taking their square roots. Similarly, the 
logit transformation applied to proportions and Fisher’s z-transformation applied to 
correlation coefficients yield quantities that are approximately normally distributed. 


Helpful Transformations To Near Normality 


Original Scale Transformed Scale 
1. Counts, y vy 
a care | p 
2. Proportions, p logit(p) = Log; z ;) (4-33) 
~ Pp 
1 1+ 
3. Correlations, r Fisher’s =. z(r) = foe( ) 
TE 


Transformations to Near Normality 193 


In many instances, the choice of a transformation to improve the approximation 
to normality is not obvious. For such cases, it is convenient to let the data suggest a 
transformation. A useful family of transformations for this purpose is the family of 
power transformations. 

Power transformations are defined only for positive variables. However, this is 
not as restrictive as it seems, because a single constant can be added to each obser- 
vation in the data set if some of the values are negative. 

Let x represent an arbitrary observation. The power family of transformations 
is indexed by a parameter A. A given value for A implies a particular transformation. 
For example, consider x* with A = —1. Since x7! = 1/x, this choice of A corre- 
sponds to the reciprocal transformation. We can trace the family of transformations 
as A ranges from negative to positive powers of x. For A = 0, we define xo =Inx.A 
sequence of possible transformations is 


z 1 
.xt==,29 = Ing x4 = Wr, xi? = vx, rx, 
shrinks large values of x increases large 
values of x 


To select a power transformation, an investigator looks at the marginal dot dia- 
gram or histogram and decides whether large values have to be “pulled in” of 
“pushed out” to improve the symmetry about the mean. Trial-and-error calculations 
with a few of the foregoing transformations should produce an improvement. The 

‘final choice should always be examined by a Q-Q plot or other checks to see 
whether-the tentative normal assumption is satisfactory. 

The transformations we have been discussing are data based in the sense that it 
is only the appearance of the data themselves that influences the choice of an appro- 
priate transformation. There are no external considerations involved, although the 
transformation actually used is often determined by some mix of information sup- 
plied by the data and extra-data factors, such as simplicity or ease of interpretation. 

A convenient analytical method is available for choosing a power transforma- 
tion. We begin by focusing our attention on the univariate case. 

Box and Cox [3] consider the slightly modified family of power transformations 


xA- 4 
Qa ty- e (4-34) 
In x A=0 


which is continuous in A for x > 0. (See [8].) Given the observations x1, X2,-.-, Xn» 
the Box-Cox solution for the choice of an appropriate power A is the solution that 
maximizes the expression 


1 an a) n 
e(a) = —2 In E Dd (a - my | +(A~-1) Sing; (4-35) 
J j=1 
We note that x? is defined in (4-34) and 


———— 12 12 xw-] 
aL FS MALS (Ao ) (4-36) 


194 Chapter 4 The Multivariate Normal Distribution 


is the arithmetic average of the transformed observations. The first term in (4-35) js, 
apart from a constant, the logarithm of a normal likelihood function, after maximiz- 
ing it with respect to the population mean and variance parameters. 

The calculation of €(A) for many values of A is an easy task for a computer. It js 
helpful to have a graph of €(A) versus A, as. well as a tabular display of the pairs — 
(A, €(A)), in order to study the behavior near the maximizing value A. For instance, 
if either A = 0 (logarithm) or A = } (square root) is near A, one of these may be pre- 
ferred because of its simplicity. : 

Rather than program the calculation of (4-35), some statisticians recommend 
the equivalent procedure of fixing A, creating the new variable 


A 
a) xy 1 


yj = ‘ ya j= 1 oar 3 (4-37) 
1a)" 
i=] 


and then calculating the sample variance. The minimum of the variance occurs at the 
same A that maximizes (4-35). 

Comment. It is now understood that the transformation obtained by maximiz- 
ing €(A) usually improves the approximation to normality. However, there is no 
guarantee that even the best choice of A will produce a transformed set of values 
that adequately conform to a normal distribution. The outcomes produced by a 
transformation selected according to (4-35) should always be carefully examined for 
possible violations of the tentative assumption of normality. This warning applies 
with equal force to transformations selected by any other technique. 


Example 4.16 (Determining a power transformation for univariate data) We gave 
readings of the microwave radiation emitted through the closed doors of n = 42 
ovens in Example 4.10. The Q-Q plot of these data in Figure 4.6 indicates that the 
observations deviate from what would be expected if they were normally distrib- 
uted. Since all the observations are positive, let us perform a power transformation 
of the data which, we hope, will produce results that are more nearly normal. 
Restricting our attention to the family of transformations in (4-34), we must find 
that value of A maximizing the function €(A) in (4-35). 
The pairs (A, €(A)) are listed in the following table for several values of A: 


A €(A) A £(A) 
—1.00 70.52 

~.90 75.65 40 106.20 
~- 80 80.46 50 105.50 
~—.70 84.94 60 104.43 
~—.60 89.06 70 103.03 
—.50 92.79 -80 101.33 
—.40 96.10 90 99.34 
~ 30 98.97 1.00 97.10 
~.20 101.39 1.10 94.64 
~.10 103.35 1.20 91.96 

00 104.83 1.30 89.10 

.10 105.84 1.40 86.07 

20 106.39 1.50 82.88 


30 106.51 


Transformations to Near Normality 195 


e(a) 


106.5 


106.0 


4=0.28 


Figure 4.12 Plot of €(A) versus A for radiation data (door closed). 


The curve of €(A) versus A that allows the more exact determination A = .28is 
shown in Figure 4.12 . ke 

It is evident from both the table and the plot that a value of A around .30 
maximizes €(A). For convenience, we choose A= .25. The data x, were 
reexpressed as 
-1 
1 


4 
and a Q-Q plot was constructed from the transformed quantities. This plot is shown 
in Figure 4.13 on page 196. The quantile pairs fall very close to a straight line, and we 
(1/4) 


would conclude from this evidence that the x;”"" are approximately normal. La 


xu 
x4) = j= 1,2,..-,42 


Transforming Multivariate Observations 


With multivariate observations, a power transformation must be selected for each of 
the variables. Let Aj, A9,..., A, be the power transformations for the p measured 
characteristics. Each A, can be selected by maximizing 


| we sR : 
€() = —F In E S (xhe — iy | + (A -1) Minx, — (4-38) 
1 f=) 


j= 


196 Chapter 4 The Multivariate Normal Distribution 


(Aa) 
ned) 


73 
~1,50 a 
fe 
2 e 
~2.00 
ye 
e 
~2.50 LO 
ye 
~3.00 
es ene” Pe 
-20 ~1.0 0 10 2.0 3.0 i 


Figure 4.13 A Q-Q plot of the transformed radiation data (door closed). 
(The integers in the plot indicate the number of points occupying the same 


location.) 


where X14, X2k,-+-» Ing are the n observations on the kth variable, k = 1,2,..., p. 


xit -1 
( ) (4-39) 


is the arithmetic average of the transformed observations. The jth transformed mul- 


Here 


tivariate observation is 


vii 
At 

(i) ia 
ie A 
a 

Xjp— 1 
Ap 


where Ay ‘ As} Seek Ay are the values that individually maximize (4-38). 


Transformations to Near Normality 197 


The procedure just described is equivalent to making each marginal distribution 
approximately normal. Although normal marginals are not sufficient to ensure that 
the joint distribution is normal, in practical applications this may be good enough. 
If not, we could start with the values i, F do, Heats Xp obtained from the preceding 
transformations and iterate toward the set of values A’ = [A,, Az,..., Ap], Which col- 
lectively maximizes 


£(A1, Ag,-+-+ Ap) 


n n n 
7 ~5 In{S(a)| + (A; -1) BS nx, + Q2-1) Sx. 


j=1 j=l 
+--+ (,-1) nx, (4-40) 
j=l 


where S(A) is the sample covariance matrix computed from 


xi) = |, pote ae 


Maximizing (4-40) not only is substantially more difficult than maximizing the indi- 
vidual expressions in (4-38), but also is unlikely to yield remarkably better results. The 
selection method based on Equation (4-40) is equivalent to maximizing a multivariate 
likehhood over yz, & and A, whereas the method based on (4-38) corresponds to maxi- 
mizing the kth univariate likelihood over u,, 0,%,, and A,. The latter likelihood is 
generated by pretending there is some A, for which the observations ( xf — 1)/Ax, 
j = 1,2,...,m havea normal distribution. See [3] and [2] for detailed discussions of the 
univariate and multivariate cases, respectively. (Also, see [8].) 


Example 4.17 (Determining power transformations for bivariate data) Radiation 
measurements were also recorded through the open doors of the n = 42 
microwave ovens introduced in Example 4.10. The amount of radiation emitted 
through the open doors of these ovens is listed in Table 4.5. 

In accordance with the procedure outlined in Example 4.16, a power transfor- 
mation for these data was selected by maximizing €(A) in (4-35). The approximate 
maximizing value was d= 30. Figure 4.14 on page 199 shows Q-Q plots of the un- 
transformed and transformed door-open radiation data. (These data were actually 


198 Chapter 4 The Multivariate Normal Distribution 


Table 4.5 Radiation Data (Door Open) 
| Oven Oven Oven 

no. Radiation no. Radiation no. Radiation 
1 _30 16 .20 31 10 

2 .09 17 04 32 10 

3 30 18 10 33 10 

4 -10 19 O1 34 30 

5 10 20 60 35 12 

6 12 21 12 = - 36 25 

7 09 22 10 37 20 

8 10 23 0S 38 40 
9 .09 24 05 39 33 

10 10 25 15 40 32 

11 07 26 30 41 12 

12 05 27 15 42 12 

13 01 28 09 

14 45 29 09 

15 12 30 28 

| 
Source: Data courtesy of J. D. Cryer. 


transformed by taking the fourth root, as in Example 4.16.) It is clear from the figure 
that the transformed data are more nearly normal, although the normal approxima- 
tion is not as good as it was for the door-closed data. 

Let us denote the door-closed data by x,1, x2,--.,%42,1 and the door-open data 
by x12, X22,-+-, %42,2. Choosing a power transformation for each set by maximizing 
the expression in (4-35) is equivalent to maximizing €,(A) in (4-38) with k = 1,2. 
Thus, using the outcomes from Example 4.16 and the foregoing results, we have 
A, = .30 and A) = .30. These powers were determined for the marginal distribu- 
tions of x, and x}. 

We can consider the joint distribution of x, and x, and simultaneously deter- 
mine the pair of powers (A,, Az) that makes this joint distribution approximately 
bivariate normal. To do this, we must maximize €(A,, Az) in (4-40) with respect to 
both A, and A. 

We computed €(Aj, A») for a grid of A;, Az values covering 0 = A, = 50 and 
0 = A, = 50, and we constructed the contour plot shown in Figure 4.15 on 
page 200. We see that the maximum occurs at about (A;, A2) = (.16, .16). 

The “best” power transformations for this bivariate case do not differ substan- 
tially from those obtained by considering each marginal distribution. a 


As we saw in Example 4.17, making each marginal distribution approximately 
normal is roughly equivalent to addressing the bivariate distribution directly and 
making it approximately normal. It is generally easier to select appropriate transfor- 
mations for the marginal distributions than for the joint distributions. 


Seale dhe ws 


cee dae bee 


Transformations to Near Normality 


iy 
60 : 
45 “ 
e 
30 Faced 
e 
e 
2 
1S 62 
30) . 
2 e 
0 
nl ee) 4; I —» 4 ;) 
-20 -10 0 10 2.0 3.0 
(a) 
(1/4) 
FG) 
00 
-.60 
e 
e 4 
—1.20 e~ 
pian 
2 
—1.80 Cee aes 
e 
a 
-2.40 —. 
2 
—3.00 
FS SS ee eae ER ne Seer yp 
-20 -10 0 1.0 2.0 3.0 
(b) 


Figure 4.14 Q-Q plots of (a) the original and (b) the transformed 
radiation data (with door open). (The integers in the plot indicate the 
number of points occupying the same location.) 


199 


200 Chapter 4 The Multivariate Normal Distribution 


0.0 0.4 0.2 0.3 04 0.5 


Figure 4.15 Contour plot of €(A,, Az) for the radiation data. 


If the data includes some large negative values and have a Single Jong tail, a 
more general transformation (see Yeo and Johnson [14]) should be applied. 


{(x + 1)*- 1}/A x20,A+*0 
pO In(x + 1) x20,A=0 
—{(-x +1)?4- 1}/(2-A) x <0,A #2 
—In(-x + 1) x<0,A=2 


Exercises ’ 
Hee eee ec mi a RI IA a a a I Ta EA a a a 
4.1. Consider a bivariate normal distribution with w, = 1, w2 = 3, o1; = 2, a2 = 1 and 
pr2 = —8. 
(a) Write out the bivariate normal density. 
(b) Write out the squared statistical distance expression (x — z)’X7!(x — wt) asa qua- 
dratic function of x, and x. 
4.2. Consider a bivariate normal population with uw, = 0, uw = 2, a1, ~ 2, o22 = 1, and 
p12 = 5. 
(a) Write out the bivariate normal density. 


4.3. 


4.4. 


4.5. 


4.6. 


Exercises 201 


(b) Write out the squared generalized distance expression (x — #2)'= 1(x — mw) asa 
function of x, and x2. 

(c) Determine (and sketch) the. constant-density contour that contains 50% of the 
probability. 


Let X be N3(y, &) with wx’ = [-3, 1, 4] and 


1 -2 0 
- B=] -2 #5 0 
0 02 


Which of the following random variables are independent? Explain. 
(a) X; and X2 
(b) X and Xx, 
(c) (X1, X2) and X3 

X, + X2 
d ———— 
(a) 5 
(e) X, and X> — 2X, — X; 
Let X be N3(#, X) with w’ = [2, -3,1] and 


and X; 


111 
z=)/1 3 2 
12 2 


(a) Find the distribution of 3X; ~ 2X, + X;3. 
(b) Relabel the variables if necessary, and find a2 x 1 vector a such that X> and 


X2 - -{ | are independent. 
X3 
Specify each of the following. 


(a) The conditional distribution of X,, given that X, = x, for the joint distribution in 
Exercise 4.2. 

(b) The conditional distribution of Xz, given that X; = x, and X; = x; for the joint dis- 
tribution in Exercise 4.3. 

(c) The conditional distribution of X3, given that X,; = x, and X; 
tribution in Exercise 4.4. 


I 


x» for the joint dis- 


Let X be distributed as N3(#1, =), where wx’ = [1, -1,2] and 


4 0 -l 
zZ= 05 O 
-1 0 2 


Which of the following random variables are independent? Explain. 
(a) X, and X, 

(b) X, and X, 

(c) X, and X3 

(d) (X,,X3) and Xp 

(e) X, and X; + 3X2 — 2X; 


202 Chapter 4 The Multivariate Nosmal Distribution 


4.7. 


4.8. 


4.9, 


4.10. 


Refer to Exercise 4.6 and specify each of the following. 
(a) The conditional distribution of X,, given that X. 3 = x3. 
(b) The conditionat distribution of X,, given that X. 2 = xX, and X3 = 23. 


(Example of a nonnormal bivariate distribution with normal marginal 
N (0, 1), and let~ arginals.) Let X, be - 
X= ~X, if-1s X, <1 
X, otherwise 


Show each of the following. 
(a) X2 also has an N(0, 1) distribution. 
(b) X, and X, do not have a bivariate normal distribution. 
Hint: : 
(a) Since X, is N(0,1), P[-1<X,< x] = Pi~x = X, <1) for an 
y x. Wh 
<1 < p< 1, Pla S m1) = PiXa Ss ~1] + Pl-1< X= x] = P(X, = 1] 
+ P[-1 <-X, Sx] = PUX, — -1} + Pl-x, = X,< 1}. But P[—x, < xX; < 1] 
= P{-1 < X; = x2] from the symmetry argument in the first line of this hint, ~ 
Thus, P[X2 = x2] = P[X, = -1] + P[-1< X,s x] = P(X, S x], which is 
a standard normal probability. , 
(b) Consider the linear combination X,; — X>, which equals zero with probability 
P[|X,| > 1) = 3174. 
Refer to Exercise 4.8, but modify the construction by replacing the break point 1 b 
cso that Wee 
oe ee if-c = X,<c 
X, elsewhere 


Show that c can be chosen so that Cov (X;, X2) = 0, but that the two random eitiabled 
are not independent. 

Hint: 

Forc = 0, evaluate Cov(X;, X2) = E[ X,(X,)] 

For c very large, evaluate Cov (Xj, 2) = E[X;(-X))}. 


Show each of the following. 
(a) 
9 
pl An 
() 
i S| = IAB for {A| #0 

Hint: 

A 0| {A oOfll 0 I 0 
(a) 0 Bl 7 lo dio pi Expanding the determinant 0’ al by the first row 


(see Definition 24.24) gives 1 times a determinant of the same form, with the order ~ 
of I reduced by one. This procedure is repeated until 1 x |B| is obtained. Similarly, 


= |Al. 


: : A 0 
expanding the determinant | _, by the last row gives 
ae | a | 


Exercises 203 


AC A Ol/r ATC ; : eeu 
(b) si = ie i 0” I . But expanding the determinant 9’ I 
A 'C 


by the last row gives = 1. Now use the result in Part a. 


0’ I 
4.11. Show that, if A is square, 

[A] = |Az2||Ai1 ~ A12A32A01| for] A22] # 0 

= |Aq)|JAo2 - Ax ATA | for|A,,| # 0 


Hint; Partition A and verify that 


Y -Ai2A33 || Arr Aig I 0) _ Ai: — Ay2A33A2, 0 
0’ I Ar, An] L—-Az3A I 0’ Ar? 


Take determinants on both sides of this equality. Use Exercise 4.10 for the first and 
third determinants on the left and for the determinant on the right. The second equality 
for | A | follows by considering 


I le Ai2 E “AA | _ | An 0 | 
~AgAT] VJ] A21 Aa} [0’ I 0’ Ax. — Ani ArIA12 


4.12. Show that, for A symmetric, 


Al = I 0] [ (Agr — AizAg$Aa1)7 0 IT ~Aj2A3} 
-Az3A,, I 0’ Az || 0° I 


Thus, (Ay; — Aj7A33A>1)~ is the upper left-hand block of AW. 


= a 
I -Aj2A22 and 
0’ I 


Hint: Premultiply the expression in the hint to Exercise 4.11 by | 


1 
: I : 
postmultiply by | “4 | . Take inverses of the resulting expression. 
—Ag7A21 J 


4.13. Show the following if X is N,(#, X) with |= | # 0. 


(a) Check that |&| = |[%22[|Xis ~ Z12%2hE21|. (Note that |X| can be factored into 
the product of contributions from the marginal and conditional distributions.) 


(b) Check that 
(x — m)'E M(x -— a) = [xy — wy — Ey2E73(x2 — M2)]' 
x (Za — Ep2E22Ea1) [xr — Ar — La2¥7h(x2 — #2)] 
+ (x2 — M2)'E2}(x2 — #2) 


(Thus, the joint density exponent can be written as the sum of two terms corresponding 

to contributions from the conditional and marginal distributions.) 

(c) Given the results in Parts a and b, identify the marginal distribution of Xj and the 
conditional distribution of X,;| X= x. 


204 Chapter 4 The Multivariate Normal Distribution 


Hint: 
(a) Apply Exercise 4.11. 
(b) Note from Exercise 4.12 that we can write (x — p)'="!(x — ws) as 


EB ~ Bp, |' I 0 i — X23)’ 0 | 
Xo — #2 =E7hE, I 0' xy} 


If we group the product so that 


I var E = | = k ~ py — Xy2%Q}(x_ — mo) 
0’ I X27 #2 X2 — p2 


the result follows. 


4.14. If X is distributed as N,(#, 2) with |X| # 0, show that the joint density can be written 
as the product of marginal densities for 


XxX, and X2 if Xy2 = 0 
(9X1) ((p-9) x1) (9X(p-4)) 


Hint: Show by block inultiplication that 
xy} 0. , x1, 0 
f2 = 
| 0 x5 is the inverse o: 0 S55 


Then write 


=i} 0 | Tx, - 
x= Sl a = = ’ oS ’ it 1 By 
(x BK) z (x #) [(%) #1) »(X) Hp) i 0’ =| E Ze S 
= (xp — My) Z710K) — oy) + (Ky — wer)! Eah(x2 - pr) 
Note that } | = |X); |} X22] from Exercise 4.10(a). Now factor the joint density. 


n n 
4.15. Show that >} (x; — ¥)(% — we)’ and > (% — #)(x; — %)! are both p x p matrices of 
fa od 


zeros. Here x} = [Xj1,%,2,---.Xjp) J = 1,2,...,n,and 
S 1 n” 
=~ Ox 
Nn jz 


4.16. Let X,, X,, X3, and X, be independent N,(#, %) random vectors. 
(a) Find the marginal distributions for each of the random vectors 


V; _ ix, > 4X2 + 1x, _- 1X, 
ane 1 1 1 
V, = 4X1 + 5X) - [X3- 5X, 
(b) Find the joint density of the random vectors V, and V2 defined in (a). 


4.17. Let X,,X.,XK,, X4, and Xs be independent and identically distributed random vectors 
with mean vector j# and covariance matrix Y. Find the mean vector and covariance ma- 
trices for each of the two linear combinations of random vectors 


1 1 1 1 1 
gXi t+ 5X2 + 5X3 + 5Xq + 5X5 


4.18. 


4.19. 


4.20. 


4.21, 


4.22. 


Exercises 205 


and 
X, — X, + X3 — X, + X; 


in terms of # and ~. Also, obtain the covariance between the two linear combinations of 
random vectors. 


Find the maximum likelihood estimates of the 2 x 1 mean vector yw and the 2 x 2 
covariance matrix % based on the random sample 


6 
X= 


huh WwW 


4 
7 
7 
from a bivariate normal population. 


Let X,, Xo,..., X29 be a random sample of size n = 20 from an Ng(#, %) population. 
Specify each of the following completely. 


(a) The distribution of (X, — #)'Z (XK, - m) 
(b) The distributions of X and Vn(X — p) 
(c) The distribution of (1 — 1)S 


For the random variables X,, X2,..., X)y in Exercise 4.19, specify the distribution of 
B(19S)B’ in each case. 


_{1 -} -} 0 0 0 
@ =|} 0 0 -} -21 
100000 
Ore Pee eres 


Let X),..., Xe9 be arandom sample of size 60 from a four-variate normal distribution 
having mean pm and covariance 2. Specify each of the following completely. 


(a) The distribution of X 

(b) The distribution of (X, — w)/Z71(X, - p) 

(c) The distribution of n(X — pw)'Z7"(K - p) 

(d) The approximate distribution of n(X — »)'S1(X - p) 

Let X,, X2,..., X75 be a random sample from a population distribution with mean pw 
and covariance matrix 2. What is the approximate distribution of each of the following? 


_(a) X 


4.23. 


4.24, 


(b) n(X — w)'S(X - p) 

Consider the annual rates of retumm (including dividends) on the Dow-Jones 
industrial average for the years 1996-2005. These data, multiplied by 100, are 

06 3.1 25.3 -16.8 -7.1 -6.2 252 226 26.0. 

Use these 10 observations to complete the following. 

(a) Construct a Q-Q plot. Do the data seem to be normally distributed? Explain. 


(b) Carry out a test of normality based on the correlation coefficient rg. [See (4-31).] 
Let the significance level be a = .10, 


Exercise 1.4 contains data on three variables for the world’s 10 largest companies as of 
April 2005. For the sales (x,) and profits (x2) data: 


(a) Construct Q-Q plots. Do these data appear to be normally distributed? Explain. 


206 Chapter 4 The Multivariate Normal Distribution 


4.25. 


4.26. 


4.27. 


4.28. 


4.29. 


4.30. 


(b) Carry out a test of normality based on the correlation coefficient rg. [See (4-31) 
Set the significance level at a = .10. Do the results of these tests corroborate the r, ; 
sults in Part a? S 


Refer to the data for the world’s 10 largest companies in Exercise 1.4. Construct a chj 
square plot using all three variables. The chi-square quantiles are , 


0.3518 0.7978 1.2125 1.6416 2.1095 2.6430 3.2831 4.1083 5.3170 7.8147 


Exercise 1.2 gives the age x1, measured in years, as well as the selling price x, measured 
in thousands of dollars, for 7 = 10 used cars. These data are reproduced as follows: 


xy 1 2 a ae ee eee ee 
X 18.95 19.00 17.95 15.54 14.00 1295 894 7.49 600 3.99 


(a) Use the results of Exercise 1.2 to calculate the squared statistical distances 
(x; — 8)'S"(x; — ¥), 7 = 1,2,...,10, where xj = [2)1, X,2]. 

(b) Using the distances in Part a, determine the proportion of the observations falling 
within the estimated 50% probability contour of a bivariate normal distribution. 

(c) Order the distances in Part a and construct a chi-square plot. 

(d) Given the results in Parts b and c, are these data approximately bivariate normal? 
Explain. : 

Consider the radiation data (with door closed) in Example 4.10. Construct a Q-Q plot 

for the natural logarithms of these data. [Note that the natural logarithm transformation 

corresponds to the value A = 0 in (4-34).] Do the natural logarithms appear to be nor- 

mally distributed? Compare your results with Figure 4.13. Does the choice A = ! or 

d = 0 make much difference in this case? 4 


The following exercises may require a computer. 


Consider the air-pollution data given in Table 1.5. Construct a Q-Q plot for the solar 

radiation measurements and carry out a test for normality based on the correlation 

coefficient rg [see (4-31)}]. Let a = .05 and use the entry corresponding to n = 40 in 

Table 4.2, 

Given the air-pollution data in Table 1.5, examine the pairs X; = NO. and X, = O3 for 

bivariate normality. 

(a) Calculate statistical distances (x; ~ x)'S'(x;-%), j =1,2,...,42, where 
x= (xjs, x56]: 

(b) Determine the proportion of observations xj = [xjs.X,6]. j = 1,2,...,42, falling 
within the approximate 50% probability contour of a bivariate normal distribution. 

(c) Construct a chi-square plot of the ordered distances in Part a. 


Consider the used-car data in Exercise 4.26. 

(a) Determine the power transformation A; that makes the x, values approximately 
normal. Construct a Q-Q plot for the transformed data. 

(b) Determine the power transformations Az that makes the x, values approximately 
normat, Construct a Q-@ plot for the transformed data. 

(c) Determine the power transformations A’ = [A,, ay] that make the [x,, x2] values 
jointly norma) using (4-40). Compare the results with those obtained in Parts a and b. 


4.31. 


4.32. 


4.33. 


4.34. 


4.35. 


4.36. 


4.37. 


4.38. 


4.39. 


Exercises 207 


Examine the marginal normality of the observations on variables X,, X,..., Xs for the 
multiple-sclerosis data in Table 1.6. Treat the non-multiple-sclerosis and multiple-sclerosis 
groups separately. Use whatever methodology, including transformations, you feel is 
appropriate. 


Examine the marginal normality of the observations on variables X,, X2,..., X6 for the 
radiotherapy data in Table 1.7. Use whatever methodology, including transformations, 
you feel is appropriate. 


Examine the marginal and bivariate normality of the observations on variables 
X,, X2, X3, and X, for the data in Table 4.3. 


Examine the data on bone mineral content in Table 1.8 for marginal and bivariate nor- 
mality. 


Examine the data on paper-quality measurements in Table 1.2 for marginal and multi- 
variate normality. 


Examine the data on women’s national track records in Table 1.9 for marginal and mul- 
tivariate normality. 


Refer to Exercise 1.18. Convert the women’s track records in Table 1.9 to speeds mea- 
sured in meters per second. Examine the data on speeds for marginal and multivariate 
normality. . 


Examine the data on bulls in Table 1.10 for marginal and multivariate normality. Consider 
only the variables YrHgt, FtFrBody, PrctFFB, BkFat, SaleHt, and Sale Wt. 


The data in Table 4.6 (see the psychological profile data: www.prenhall.com/statistics) con- 
sist of 130 observations generated by scores on a psychological test administered to Peru- 
vian teenagers (ages 15, 16, and 17). For each of these teenagers the gender (male = 1, 
female = 2) and socioeconomic status (low = 1, medium = 2) were also recorded. The 
scores were accumulated into five subscale scores labeled independence (indep), support 
(supp), benevolence (benev), conformity (conform), and leadership (leader). 


rs | 
Table 4.6 Psychological Profile Data 
Indep Supp Benev Conform Leader Gender Socio 
27 13 14 20 11 2 1 
12 13 24 25 6 2 1 
14 20 15 16 7 2 1 
18 20 17 12 6 2 1 
3 22 24 6 2 1 
10 1 26 17 10 i 2 
14 12 14 11 29 1 2 
19 11 23 18 13 2 2 
27 19 22 7 9 2 2 
10 17 22 22 8 2 2 
Source: Data courtesy of C. Soto. 


(a) Examine each of the variables independence, support, benevolence, conformity and 
leadership for marginal normality. 


(b) Using all five variables, check for multivariate normality. 


(c) Refer to part (a). For those variables that are nonnormal, determine the transformation 
that makes them more nearly normal. 


208 Chapter 4 The Multivariate Normal Distribution 


4.40. 


4.4l. 


References 


1. 


Consider the data on national parks in Exercise 1.27. 

(a) Comment on any possible outliers in a scatter plot of the original variables. 

(b) Determine the power transformation hy the makes the x; values approximate| : 
normal. Construct a Q-Q plot of the transformed observations. y 

(c) Determine the power transformation hy the makes the x, values approximate] is 
normal. Construct a Q—Q plot of the transformed observations. y 

(d) Determine the power transformation for approximate bivariate normality using- 
(4-40). - 


Consider the data on snow removal in Exercise 3.20. - : 
(a) Comment on any possible outliers in a scatter plot of the original variables. ° 


(b) Determine the power transformation A, the makes the x; values approximatel : 
normal. Construct a Q—Q plot of the transformed observations. y : 


(c) Determine the power transformation A, the makes the x values approximatel 
normal. Construct a Q—Q plot of the transformed observations. y 


(d) Determine the power transformation for approximate bivariate normality using 
(4-40). ~ 


Anderson, T. W. An Introduction to Multivariate Statistical Analysis (3rd ed.). New York: 
John Wiley, 2003. . 


. Andrews, D. F, R. Gnanadesikan, and J. L. Warner. “Transformations of Multivariate 


Data.” Biometrics, 27, no. 4 (1971), 825-840. 


. Box, G. E. P, and D. R. Cox. “An Analysis of Transformations” (with discussion). Journal 


of the Royal Statistical Society (B),26,no.2 (1964), 211-252. 


_ Daniel, C. and F. S. Wood, Fitting Equations to Data: Computer Analysis of Multifactor 


Data. New York: John Wiley, 1980. 


. Filliben, J. J. “The Probability Plot Correlation Coefficient Test for Normality.” 


Technometrics, 17, no. 1 (1975), 111-117. 


. Gnanadesikan, R. Methods for Statistical Data Analysis of Multivariate Observations 


(2nd ed.), New York: Wiley-Interscience, 1977. 


7. Hawkins, D. M. /dentification of Outliers. London, UK: Chapman and Hall, 1980. 


11. 


12. 


- Hernandez, F, and R. A. Johnson. “The Large-Sample Behavior of Transformations to 


Normality.” Journal of the American Statistical Association, 75, no. 372 (1980), 855-861. 


Hogg, R. V., Craig. A. T. and J. W. Mckean Introduction to Mathematical Statistics (6th 
ed.). Upper Saddle River, N.J.: Prentice Hall, 2004. ; 


. Looney, S. W., and T. R. Gulledge, Jr. “Use of the Correlation Coefficient with Normal 


Probability Plots.” The American Statistician, 39, no. 1 (1985), 75-79. 
Mardia, K. V., Kent, J.T. and J. M. Bibby. Mutivariate Analysis (Paperback). London: 
Academic Press, 2003. 


Shapiro, S. S., and M. B. Wilk. “An Analysis of Variance Test for Normality (Complete 
Samples).” Biometrika, 52, no. 4 (1965), 591-611. 


Exercises 209 


13. Verrill, S, and R. A. Johnson. “Tables and Large-Sample Distribution Theory for 
Censored-Data Correlation Statistics for Testing Normality.” Journal of the American 
Statistical Association, 83, no. 404 (1988), 1192-1197. 


14. Yeo, I. and R.A. Johnson “A New Family of Power Transformations to Improve Normal- 
ity or Symmetry.” Biometrika, 87, no. 4 (2000), 954-959. 


15. Zehna, P. “Invariance of Maximum Likelihood Estimators.” Annals of Mathematical 
Statistics, 37, no. 3 (1966), 744. 


Chapter 


INFERENCES ABOUT A MEAN VECTOR 


5.1 Introduction 


This chapter is the first of the methodological sections of the book. We shall now use 
the concepts and results set forth in Chapters 1 through 4 to develop techniques for 
analyzing data. A large part of any analysis is concerned with inference—that is, 
reaching valid conclusions concerning a population on the basis of information from a 
sample. 

At this point, we shall concentrate on inferences about a population mean 
vector and its component parts. Although we introduce statistical inference through 
initial discussions of tests of hypotheses, our ultimate aim is to present a full statisti- 
cal analysis of the component means based on simultaneous confidence statements. 

One of the central messages of multivariate analysis is that p correlated 
variables must be analyzed jointly. This principle is exemplified by the methods 
presented in this chapter. 


5.2 The Plausibility of 5 as a Value for a Normal 
Population Mean 


Let us start by recalling the univariate theory for determining whether a specific value 
Ho is a plausible value for the population mean yw. From the point of view of hypothe- 
sis testing, this problem can be formulated as a test of the competing hypotheses 


Ho: w= Wo and Hy: p # po 


Here Hp is the null hypothesis and H, is the (two-sided) alternative hypothesis. If 
X,, X2,--., X, denote a random sample from a normal population, the appropriate 
test statistic is 


_ &- w) ae ee ae ae ay 
= where X=— DY; and s* = >» (X; -X) 


ary ee 
s/Vn at n-1 


The Plausibility of jp as a Value for a Normal Population Mean 211 


This test statistic has a student’s r-distribution with n — 1 degrees of freedom (d.f). 
We reject Hp, that jo is a plausible value of , if the observed |1| exceeds a specified 
percentage point of a f-distribution with n — 1 df. 

Rejecting Hp when |t| is large is equivalent to rejecting Hp if its square, 


2 = (X = bo)” 
s?/n 


is large. The variable ¢? in (5-1) is the square of the distance from the sample mean 
X to the test value pig. The units of distance are expressed in terms of s/n, or esti- 
mated standard deviations of X. Once X and s® are observed, the test becomes: 
Reject Hp in favor of Hj, at significance level a, if 


n(% — po)(s?)"'(% — wo) > B-1(a/2) (5-2) 


where f,,-;(a/2) denotes the upper 100(a@/2)th percentile of the t-distribution with 
n-1df 

If Hp is not rejected, we conclude that po is a plausible value for the normal 
population mean. Are there other values of « which are also consistent with the 
data? The answer is yes! In fact, there is always a set of plausible values for a nor- 
mal population mean. From the well-known correspondence between acceptance 
regions for tests of Hy: w = fo versus Hy: w * Mo and confidence intervals for p, 
we have 


= n(¥ — po)(s?)"(® — po) (5-1) 


X — Mo 
s/Vn 


{Do not reject Ho: w = gat levela} or S t,-,(a/2) 


is equivalent to 


{ uales in the 100(1 — a@)% confidence interval ¥ + t-(e/2)Se} 
or 
<= tajte/\—= Sg 84 tle) (5-3) 
n-1 Vn == n-1 Vn 


The confidence interval consists of all those values zy that would not be rejected by 
the level a test of Ho: w = Uo. 

Before the sample is selected, the 100(1 — @)% confidence interval in (5-3) is a 
random interval because the endpoints depend upon the random variables X and s. 
The probability that the interval contains p is 1 — a; among large numbers of such 
independent intervals, approximately 100(1 — a)% of them will contain pw. 

Consider now the problem of determining whether a given p X 1 vector Moy isa 
plausible value for the mean of a multivariate normal distribution. We shall proceed 
by analogy to the univariate development just presented. 

A natural generalization of the squared distance in (5-1) is its multivariate analog 


2_ (*% f1e\i se x rex 
T? = (K ~ wa)(28) (K — wo) = mC ma)'SK~ oo) (4) 


Z\2 Chapter 5 Inferences about a Mean Vector 


where 
Mig 
= 12 420 
X =—>) xX, S = X; — X)(X; - X and i ={°° 
(px1) n>  (pxp) on =P: > x east oye Nae 
e iiss 


The statistic T? is called Hotelling’s T? in honor of Harold Hotetling, a pioneer in . 
multivariate analysis, who first obtained its sampling distribution. Here (1/7)S is the 
estimated covariance matrix of X. (See Result 3.1.) 

If the observed statistical distance T? is too large—that is, if ¥ is “too far” from - 
f4o—the hypothesis Ho: x = 4g is rejected. It turns out that special tables of T? per- 
centage points are not required for formal tests of hypotheses. This is true because 


daa ain eq get Ps eae 
T* is distributed as (n= p) re ( 5). 
where F, ,,, denotes a random variable with an F-distribution with p and n — pdf. 


To summarize, we have the following: 


Let X,, Xa). .., X,, be a random sample from an N,(m, X) population. Then 
1 2 = s\ 
> (X; -— X)(X; - Xy, 
1) 4 


j=l (n = 


a= a > nate ta) 


Gr p)s"? 


= P| mi - n)'S'(X — p) > a = Pr, 9a) | (5-6) 


whatever the true x and &. Here F, 
the F, ,-p distribution. 


(a) is the upper (100a)th percentile of 


PNP 


Statement (5-6) leads immediately to a test of the hypothesis Hp: se = po versus 
Hy: p * po. At the a@ level of significance, we reject Hy in favor of H, if the 
observed 
7 rege (n— 1)p 
T? = n(X — Bo)'S(X — wo) > ———> Fonp(2) (5-7) 
(n— p) 

It is informative to discuss the nature of the T?-distribution briefly and its cor- 
respondence with the univariate test statistic. In Section 4.4, we described the man- 
ner in which the Wishart distribution generalizes the chi-square distribution. We 
can write 


(x) - Xx) - &)’ 7 
= Vn (X — po)’ |} vn ( ~ pm») 


n-1 


The Plausibility of 4) as a Value for a Normal Population Mean 213 


which combines a normal, N,(0, %), random vector and a Wishart, W,, ,-1(%), random 
matrix in the form 


_{ Wishart random \~ 
72 _ { multivariate normal matrix multivariate normal 
pal 
random vector df. random vector 


“1 
= 4 (0.2) |W, 08) | N(0.8) (5-8) 


This is analogous to 
P= Vn (X — wo)(s?) Vn (X — uo) 
or 
(scaled) chi-square 


ae norma! random variable normal 
at random variable d.f. random variable 


for the univariate case. Since the multivariate normal and Wishart random variables 
are independently distributed [see (4-23)], their joint density function is the product 
of the marginal normal and Wishart distributions. Using calculus, the distribution 
(5-5) of T? as given previously can be derived from this joint distribution and the 
representation (5-8). 

It is rare, in multivariate situations, to be content with a test of Hp: m = Mo, 
where all of the mean vector components are specified under the null hypothesis. 
Ordinarily, it is preferable to find regions of yw values that are plausible in light of 
the observed data. We shall return to this issue m Section 5.4. 


Example 5.1 (Evaluating T7) Let the data matrix for a random sample of size n = 3 
from a bivariate normal population be 


6 9 
X=/]10 6 
8 3 


Evaluate the observed T? for 4) = [9, 5]. What is the sampling distribution of 7? in 
this case? We find 


6+10+8 
Lal] ovees [Le] 
Xz 9+6+4+3 6 
3 
and 
(6 — 8)? + (10 — 8)? + (8 — 8)? 
5117 = 2 =4 
(6 — 8)(9 — 6) + (10 — 8)(6 — 6) + (8 — 8)(3 - 6) 
312 = ————_——\—_— = -3 


2 


(9 ~ 6)? + (6 — 6)? + (3 + 6)? 
$22 = 2 = 9 


214 Chapter 5 Inferences about a Mean Vector 


so 


Thus, 


eay-toeal TL 
(4)(9) — (-3)(-3) {3 4 on 


and, from (5-4), 
T? =3(8-9, 6- sil 


Ol Wale 
Re wir 
| 
7 
an © 
1] 
wn oO 
UL___} 
W 
Ww 
| 
os 
= 


Before the sample is selected, T” has the distribution of a 
(3-192 
2) 


random variable. m 


Fy 3-2 = 4F1 


The next example illustrates a test of the hypothesis Hp: uw = Mo using data 
collected as part of a search for new diagnostic techniques at the University of 
Wisconsin Medical School. 


Example 5.2 (Testing a multivariate mean vector with 77) Perspiration from 20 
healthy females was analyzed. Three components, X, = sweat rate, X, = sodium 
content, and X, = potassium content, were measured, and the results, which we call 
the sweat data, are presented in Table 5.1. 

Test the hypothesis Hg: w’ = [4, 50, 10] against H,: x’ # [4, 50, 10] at level of 
significance a = .10. 

Computer calculations provide 


4,640 2.879 10.010 ~-1.810 
x = | 45.400 |, $=! 10.010 199.788 -5.640 
9,965 -1810 -5.640 3.628 


and 
586 -.022 .258 


S't=}-.022 006 —.002 
258 -002 402 
We evaluate 
T= 
586 —.022 258 4.640 — 4 
20(4.640 - 4, 45.400 — 50, 9.965-10]| ~.022 .006 -.002 }| 45.400 — 50 
258 ~.002 402} | 9.965 — 10 


467 
= 20[.640, -—4.600, —.035] | —.042 | = 9.74 
.160 


The Plausibility of sry as a Value for a Normal Population Mean 2 15 


Table 5.1 Sweat Data 
Xx, X X; 
Individual (Sweat rate) (Sodium) (Potassium) 
1 3.7 48.5 9.3 
5.7 65.1 8.0 
3 3.8 47.2 10.9 
4 py) 53.2 12.0 
5 3.1 55.5 9.7 
6 4.6 36.1 7.9 
7 2.4 248 14.0 
8 72 33.1 7.6 
9 6.7 47.4 8.5 
10 5.4 54.1 11.3 
11 3.9 36.9 12.7 
12 4.5 58.8 12.3 
13 3.5 27.8 9.8 
14 45 40.2 8.4 
15 15 13.5 10.1 
16 8.5 56.4 71 
17 45 71.6 8.2 
18 6.5 52.8 10.9 
19 41 44.1 11.2 
20 5.5 40.9 9.4 
Source: Courtesy of Dr. Gerald Bargman. 


Comparing the observed T? = 9.74 with the critical value 


oe Fpn-p(-10) = 19) (10) = 3.353(2.44) = 8.18 


17 
we see that T? = 9.74 > 8.18, and consequently, we reject Hy at the 10% level of 
significance. 

We note that Hp will be rejected if one or more of the component means, or 
some combination of means, differs too much from the hypothesized values 
[4, 50, 10]. At this point, we have no idea which of these hypothesized values may 
not be supported by the data. 

We have assumed that the sweat data are multivariate normal. The Q-Q plots 
constructed from the marginal distributions of X,, Xz, and X, all approximate 
straight lines. Moreover, scatter plots for pairs of observations have approximate 
elliptical shapes, and we conclude that the normality assumption was reasonable in 
this case. (See Exercise 5.4.) = 


One feature of the T”-statistic is that it is invariant (unchanged) under changes 
in the units of measurements for X of the form 


Y= C X +d, C_ nonsingular (5-9) 
(px1)  (pxp)(px1) (p11) 


216 Chapter5 Inferences about a Mean Vector 
A transformation of the observations of this kind arises when a constant 5, is § 
subtracted from the ith variable to form X;— b; and the result is multiplied 4 
by a constant a; > 0 to get a,(X; — b,). Premultiplication of the centered and 
scaled quantities a;(X; — b;) by any nonsingular matrix will yield Equation (5- -9)_ 
As an example, the operations involved in changing X; to a;(X; — 5;) correspond 
exactly to the process of converting temperature from a Fahrenheit to a Celsius:: 
reading. 3 

Given observations x), x2,..-, x, and the transformation in (5-9), it immediately 

follows from Result 3.6 that : 


y=Cx+d and S, =F ZO-NH-y'= ose 
Moreover, by (2-24) and (2-45), 
By = E(Y¥) = E(CX + d) = E(CX) + E(d)=Cyh +d 


Therefore, 7? computed with the y’s and a hypothesized value By.o = Cyy + dis ~ 


T? = n(y — pyo)'Sy(¥ — Hy0) 

= n(C(% — po))' (CSC’)* (C(X — poo) 

= n(& — po)'C'(CSC’) C(X — pro) 

n(% — Mo)'C'(C’) *S1CAC(R — po) = n(K — Ho)'S(X — po) 


The last expression is recognized as the value of T? computed with the x’s. 


5.3 Hotelling’s T? and Likelihood Ratio Tests 


We introduced the T?-statistic by analogy with the univariate squared distance ¢?. 
There is a general principle for constructing test procedures called the likelihood 
ratio method, and the T?-statistic can be derived as the likelihood ratio test of Ho: 
#t = po. The general theory of likelihood ratio tests is beyond the scope of this 
book. (See [3] for a treatment of the topic.) Likelihood ratio tests have several 
optimal properties for reasonably large samples, and they are particularly conve- 
nient for hypotheses formulated in terms of multivariate normal parameters. 

We know from (4-18) that the maximum of the multivariate normal likelihood 
as yz and & are varied over their possible values is given by 


1 


———— ep 5-10) 
(2myr7r |S (r 


max L(y, 2) = 
[7 > 


where 


aie 


> & — %)(x; — x)’ and f=e-2 dx 
j=1 j=l 


are the maximum likelihood estimates. Recall that # and = are those choices for # 
and & that best explain the observed values of the random sample. 


Hotelling’s T? and Likelihood Ratio Tests 217 


Under the hypothesis Hy: # = yo, the normal likelihood specializes to 


1 Li lee 
L(mo,%) = Paice ae (xj — Mo) & (x; - )) 


The mean jig is now fixed, but ¥ can be varied to find the value that is “most likely” 
to have led, with ro fixed, to the observed sample. This value is obtained by maxi- 
mizing L(j1o, &) with respect to &. 

Following the steps in (4-13), the exponent in L(gzo, &) may be written as 


ap uf Gy = no) (xy wor | 


“ef ($ mt) 


i] 


Nike 


> = Ee Ha) 


Applying Result 4.10 with B = ‘5! (x; — o)(X; — mo)’ and b = n/2, we have 
j=l 


1 


= ——_—___—__p-»p/2 (5-11) 
(2mry"??| So ("? 


max L( eo, =) 
with 


* n 
Xo = - D (x) — Ho) (x; — Mo)’ 
j=l 
To determine whether jy is a plausible value of yz, the maximum of L(so, ¥) is 
compared with the unrestricted maximum of L(y, 2). The resulting ratio is called 
the likelihood ratio Statistic. 
Using Equations (5-10) and (5-11), we get 


max L(mo, &) Cae 


Likelihood ratio = A = ———————— = rs 
| Xo| 


max L(p,=) oa) 
iT» 


The equivalent statistic A?" = |%|/|Zo| is called Wilks’ lambda. If the 
observed value of this likelihood ratio is too small, the hypothesis Ho: w = Mo is 
unlikely to be true and is, therefore, rejected. Specifically, the likelihood ratio test of 
Ao: m = po against Hy: wp # pg rejects Hp if 


7 _ 2u n/2 
r (8 | : » (x, > X)(x; —x)} - 
= 7 = c 
| Xo | 


a (5-13) 


> (x; = Ho) (x; sr Ho)’ 
j= 


where c, is the lower (100a)th percentile of the distribution of A. (Note that the 
likelihood ratio test statistic is a power of the ratio of generalized variances.) Fortu- 
nately, because of the following relation between T? and A, we do not need the 
distribution of the latter to carry out the test. 


218 Chapter5 Inferences about a Mean Vector 


Result 5.1. Let X,, X2,...,X,, be a random sample from an N,(m, 2) population, ~ 
Then the test in (5-7) based on T? is equivalent to the likelihood ratio test of 
Ao: M = po Versus H,: x *# po because 


T? -1 
AV7= [1 + 
- (: (n- i} 


Proof. Let the (p + 1) X (p + 1) matrix 


soisecessPeatacsees, 


Fi bel ees So ae oe ROL re rene nnnncceerncececns eee 


> (x; — X) (x; ~ x)! i Vas = we = Ee “2 
Vn(% — po) | -1 


By Exercise 4.11, |A| = [Az2{{A11 — AizAg}Aai| = [Ani || Az2 ~ AviATA D2), 
from which we obtain 


(~1) 


> (xj — ¥) (x; - x) + (X — po) (XK — Mo)’ 


n n -1 
= p (x; = x) (x; — x)’ —1 = n(x = Bo) (S (x; = x) (x; — x)') (x a Mo) 
fat. {1 
Since, by (4-14), 
3) (xy ~ mo) (x) ~ 0)’ = 3 De) ae ee) 


= > (x) — X) (yj — ®) + n(% — po) (® — Ho)’ 


the foregoing equality involving determinants can be written 


2 
(~1) 3 (x) = 0) (x; — Ho)’ — 8) (xj ~ ¥)' ane : Te 5) 

or 

. m 2 

|nXo| = |nz| (: * ae 5) 

Thus, 

; _ 13). T2 ~1 

A" = et (: + (oe 5) (5-14) 


Here A) is rejected for small values of A?/" or, equivalently, large values of T?. The 
critical values of T? are determined by (5-6). = 


Hotelling’s T? and Likelihood Ratio Tests 219 


Incidentally, relation (5-14) shows that T? may be calculated from two determi- 
nants, thus avoiding the computation of $~!. Solving (5-14) for T?, we have 


(nm —1)| ol _ 


T= - 
| >| 


(n ~ 1) 


(m — 1) | SS (ej — m0) (x) — a0)! 
= 4 - (n - 1) (5-15) 


n 


>D (x; — ¥) (x; — x)’ 


j=l 


Likelihood ratio tests are common in multivariate analysis. Their optimal 
large sample properties hold in very general contexts, as we shall indicate shortly. 
They are well suited for the testing situations considered in this book. Likelihood 
ratio methods yield test statistics that reduce to the familiar F- and f-statistics in uni- 
variate situations. 


General Likelihood Ratio Method 


We shall now consider the general likelihood ratio method. Let @ be a vector consist- 
ing of all the unknown population parameters, and let L(@) be the likelihood function 
obtained by evaluating the joint density of X,, X5,..., X,, at their observed values 
X1,X2,--.,X,. The parameter vector @ takes its value in the parameter set @. For 
example, in the p-dimensional multivariate normal case, @' = [1,..-, Mp; 
Ty15-++1 Fp, F22s-++sF2ps+-+» Tp-1,p»F% pp] and ® consists of the p-dimensional 
Space, where — CO <p, < ©&,..., — © <p,< co combined with the 
[p(p + 1)/2]-dimensional space of variances and covariances such that © is positive 
definite. Therefore, ® has dimension vy = p + p(p + 1)/2. Under the null hypothesis 
Ho: @ = 8p, @ is restricted to lie in a subset @p of @. For the multivariate normal 
situation with fz = jo and & unspecified, @y = {41 = w19, 2 = M20.--+ step = pod 
Tits-++sF1p> F229-++, F2p,--+» Tp-1,ps Fpp With X positive definite}, so @o has 
dimension v) = 0 + p(p + 1)/2 = p(p + 1)/2. 
A likelihood ratio test of Hp: @ € @p rejects Hy in favor of Hy: 8 € @p if 
max L(@) 
~ max L(@) me G18) 
6c8 

where c isa Suitably chosen constant. Intuitively, we reject Ho if the maximum of the 
likelihood obtained by allowing @ to vary over the set @o is much smaller than 
the maximum of the likelihood obtained by varying @ over a]l values in @. When the 
maximum in the numerator of expression (5-16) is much smaller than the maximum 
in the denominator, @9 does not contain plausible values for @. 

In each application of the likelihood ratio method, we must obtain the sampling 
distribution of the likelihood-ratio test statistic A. Then c can be selected to produce 
a test with a specified significance level a. However, when the sample size is large 
and certain regularity conditions are satisfied, the sampling distribution of —2 In A 
is well approximated by a chi-square distribution. This attractive feature accounts, in 
part, for the popularity of likelihood ratio procedures. 


220 Chapter 5 Inferences about a Mean Vector 


Result 5.2. When the sample size n is large, under the null hypothesis Ho, 


is, approximately, a vow random variable. Here the degrees of freedom are » — Vp : 
= (dimension of @) — (a dimension of @). m 

Statistical tests are compared on the basis of their power, which is defined as the | ; 
curve or surface whose height is P[test rejects H|@], evaluated at each parameter : 4 
vector @. Power measures the ability of a test to reject Hy when it is not true. In the ® 
rare situation where 8 = @ is completely specified under Ho and the alternative H, < 
consists of the single specified value @ = @,, the likelihood ratio test has the highest - 
power among all tests with the same significance levela = P[test rejects Hy|@ = 6]. ° 
In many single-parameter cases (@ has one component), the likelihood ratio test is 
uniformly most powerful against all alternatives to one side of Hp: @ = 6. In other 
cases, this property holds approximately for large samples. - 

We shall not give the technical details required for discussing the optimal prop- 
erties of likelihood ratio tests in the multivariate situation. The general import of 
these properties, for our purposes, is that they have the highest possible (average) 
power when the sample size is large. 


5.4 Confidence Regions and Simultaneous Comparisons 
of Component Means 


To obtain our primary method for making inferences from a sample, we need to ex- 
tend the concept of a univariate confidence interval to a multivariate confidence re- 
gion. Let @ be a vector of unknown population parameters and © be the set of all 
possible values of @. A confidence region is a region of likely @ values. This region is 
determined by the data, and for the moment, we shall denote it by R(X), where 
X = [X,, X2,..-, X,]’ is the data matrix. 
The region R(X) is said to be a 100(1 — a@)% confidence region if, before the 
sample is selected, 
P[R(X) will cover the true 6] = 1 — a (5-17) 


This probability is calculated under the true, but unknown, value of @. 
The confidence region for the mean y of a p-dimensional normal population is 
available from (5-6). Before the sample is selected, 


(n — 1)p 
(n ~ p) 
whatever the values of the unknown yu and &. In words, X will be within 

[(n - 1) pF, n—p(@)/(n = p))? 


of yz, with probability 1 — a, provided that distance is defined in terms of nS”. 
For a particular sample, x and § can be computed, and the inequality 


P| a(X -pyS'X-p)s Fin-p(@)|=1—-—a@ 


Confidence Regions and Simultaneous Comparisons of Component Means 221 


n(% — p)'S1(X — w) = (n — 1)pFy-p(@)/(m ~ p) will define a region R(X) 
within the space of all possible parameter values, In this case, the region will be an 
ellipsoid centered at x. This ellipsoid is the 100(1 — @)% confidence region for su. 


A 100(1 — a)% confidence region for the mean of a p-dimensional normal 
distribution is the ellipsoid determined by all y such that 


p(n ~ 1) 


x — p)'S1(x — < 5-18 

n(x — p)'S(k-— pw) <= ea) Fyncip(@) (5-18) 

eee es Sess i S (xp — 5)(x) - 5) and X),Xz,..-,X, are 
n j=l (n~-1) fq 


the sample observations. 


To determine whether any jo lies within the confidence region (is a 
plausible value for 4), we need to compute the generalized squared distance 
n(X — pto)'S"'(X — wo) and compare it with [p(n — 1)/(n — p)]Fy.a-p(@). If the 
squared distance is larger than | p(n — 1)/(n — p)]Fp.n—p(@), #0 is not in the confi- 
dence region. Since this is analogous to testing Ho: x = flo versus Hy: pe # fo [see 
(5-7)], we see that the confidence region of (5-18) consists of all 419 vectors for which 
the T?-test would not reject Hg in favor of Hj at significance level a. 

For p = 4, we cannot graph the joint confidence region for 4x. However, we can 
calculate the axes of the confidence ellipsoid and their relative lengths. These are 
determined from the eigenvalues A; and eigenvectors e; of S. As in (4-7), the direc- 
tions and lengths of the axes of 


n&~ nYSHR-p) =P = FD, fa) 


are determined by going 


Vij ef Vn = VA; V p(n - 1)Fp.n—p(a)/n(n — p) 
units along the eigenvectors e;. Beginning at the center x, the axes of the confidence 
ellipsoid are 
n-1 
tv); Pn), n—p(@) @; where Se; = A,e,, i= 1,2,...,p (5-19) 
n(n — p) © 


The ratios of the A,’s will help identify relative amounts of elongation along pairs 
of axes. 


Example 5.3 (Constructing a confidence ellipse for 4) Data for radiation from 
microwave ovens were introduced in Examples 4.10 and 4.17. Let 


x, = Wmeasured radiation with door closed 


and 


4 . : : 
X2 = V measured radiation with door open 


222 Chapter 5 Inferences about a Mean Vector 


For the 7 = 42 pairs of transformed observations, we find that 


- =| 564 g | 0144 0117 

603 f° 0117 0146 | 
; gt _ | 203018 -163.391 
-163.391 200.228 


The eigenvalue and eigenvector pairs for S are 
A, = 026, =e, = [.704, .710] 
Ag = 002, es = [—.710, .704] 
The 95% confidence ellipse for 4 consists of all values (u2;, 42) satisfying 


203.018 —163.391 | [564 — 
42[.564 — wy, 603 ~ #2] Ree | bee = al 


2(41) 
Sage F, 4o(.05) 


or, since Fy 49(.05) = 3.23, 


42(203.018) (.564 — 44)? + 42(200.228) (.603 — uy)? 
— 84(163.391) (564 — 2,)(.603 - pw) < 6.62 


To see whether yx’ = [.562, .589] is in the confidence region, we compute 


42(203.018) (.564 — .562)? + 42(200.228) (.603 — .589)? 
— 84(163.391) (.564 — .562)(.603 — .589) = 1.30 < 6.62 


We conclude that se’ = [.562, .589] is in the region. Equivalently, a test of Hp: 
B= EA would not be rejected in favor of H,: p # ba at the a = .05 level 
of significance. 

The joint confidence ellipsoid is plotted in Figure 5.1. The center is at 
x’ = [.564, .603], and the half-lengths of the major and minor axes are given by 


| p(n — 1) a 
VAy are n(n ap) Penel@) = V.026 aay 023) 064 


p(n — 1) 2(41) 
Say Pane = V.002 
VR n(n = p) Tere) = VP V oc) 
respectively. The axes lie along e; = [.704, .710] and e = [-.710, .704] when these 
vectors are plotted with x as the origin. An indication of the elongation of the confi- 
dence ellipse is provided by the ratio of the lengths of the major and minor axes. 
This ratio is 


and 


(3.23) = .018 


Confidence Regions and Simultaneous Comparisons of Component Means 223 


2 
A 
0.65 — 
vf O60 -—- - — 
| 
' 
0.55 1 
I 
! Figure 5.1 A 95% confidence 
0.50 0.55 0.60 ellipse for 4. based on microwave- 
x, radiation data. 
The length of the major axis is 3.6 times the length of the minor axis. = 


Simultaneous Confidence Statements 


While the confidence region n(x — y:)'S"1(x — m«) = c’, for c a constant, correctly 
assesses the joint knowledge concerning plausible values of 42, any summary of con- 
clusions ordinarily includes confidence statements about the individual component 
means. In so doing, we adopt the attitude that all of the separate confidence state- 
ments should hold simultaneously with a specified high probability, It is the guaran- 
tee of a specified probability against any statement being incorrect that motivates 
the term simultaneous confidence intervals. We begin by considering simultaneous 
confidence statements which are intimately related to the joint confidence region 
based on the T?-statistic. 
Let X have an N,(#, %) distribution and form the linear combination 


Z = aX, + agX,+--° + a,Xp = a'X 
From (2-43), 
bz = E(Z) = a'p 
and 
o% = Var(Z) = a'Za 


Moreover, by Result 4.2, Z has an N(a‘z,a’%a) distribution. If a random sample 
X,,X2,..., X,, from the N,(#, %) population is available, a corresponding sample 
of Z’s can be created by taking linear combinations. Thus, 


Zj = a, Xj, + a2Xj2 +++ + apXj, = a'Xj j=1,2,...,7 
The sample mean and variance of the observed values z), 22,..., Z, are, by (3-36), 


Z = a'x 


224 Chapter 5 Inferences about a Mean Vector 


and 
s? = a'Sa 


where x and S are the sample mean vector and covariance matrix of the x? ° 
respectively. - 4 
Simultaneous confidence intervals can be developed from a consideration of con. 
fidence intervals for = p for various choices of a. The argument proceeds as follows : 
For a fixed and o% unknown, a 100(1 — a)% confidence interval for zz = a’ mn 4 
is based on student’s t-ratio : 


_ 2s pwz _ Vala'e-a'p) : 
s,/Vn 7 Va'Sa (5-20) 


and leads to the statement 


as Sz = Ss 
z- tn—1(0/2) S pz SZ + t,-1(a/2) Vi 


or 


a'x — t,-1(a/2) “= Sale S ax + t,-\(a/2) vase (5-21) 


where t,,_\(a/2) is the upper 100(a/2)th percentile of a t-distribution with n — 1 df. 

Inequality (5-21) can be interpreted as a statement about the components of the 
mean vector yt. For example, with a’ = [1,0,...,0}, a’g = yy, and (5-21) becomes 
the usual confidence interval for a normal population mean. (Note, in this case, that 

a’Sa = 511.) Clearly, we could make several confidence statements about the com- 
ponents of 42, each with associated confidence coefficient 1 ~ a, by choosing differ- 
ent coefficient vectors a. However, the confidence associated with all of the 
statements taken together is not 1 — a. 

Intuitively, it would be desirable to associate a “collective” confidence coeffi- 
cient of 1 — a@ with the confidence intervals that can be generated by all choices of 
a. However, a price must be paid for the convenience of a large simultaneous confi- 
dence coefficient: intervals that are wider (less precise) than the interval of (5-21) 
for a specific choice of a. 

Given a data set x,,X>,...,X, and a particular a, the confidence interval in 
(5-21) is that setof a’ ys values for which 


Ma (a'X ~ a’) 
a‘Sa 


lee: = ty-1(a/2) 


or, equivalently, 


i , 2 = 2 
n(a’x— a’) n(a’(x — p)) 
2 = FS 
; a’Sa . a’Sa = f-1(a/2) (5-22) 
A situltaneous confidence region is given by the set of a’ # values such that ?” is rel- 
atively small for all choices of a. It seems reasonable to expect that the constant 
t2_1(@/2) in (5-22) will be replaced by a larger value, c’, when statements are devel- 


oped for many choices of a. 


Confidence Regions and Simultaneous Comparisons of Component Means 225 


Considering the values of a for which ¢? = c? 


mination of 


, We are naturally led to the deter- 


Pe erecta) 


Using the maximization lemma (2-50) with x = a,d = (x — yw), and B = S, we get 


n(a'(x — p)) | (a(x — pw) 
a en max isa 


are Se | = n(x — p)'S\(x — pw) =T? (5-23) 


with the maximum occurring for a proportional to S(x — x). 


Result 5.3. Let X,, X2,..., X, be a random sample from an N,(#, Z) population 
with & positive definite. Then, simultaneously for all a, the interval 


(«x - (8 - est Fy.n-p(a)a’Sa, a’ X + [2 Fou ofeda'se | 


will contain a‘ with probability 1 — a. 


Proof. From (5-23), 
'= — ala \2 
ange piste) SS Ampies, Oe 
a‘'Sa 
for every a, or 


i a’Sa / = a’Sa 
ax —c,]- Sapsaxtc ye 


for every a. Choosing c? = p(n — 1)Fp.n—p(@)/( — p) [see (5-6)] gives intervals 
that will contain a’ w for all a, with probability 1 - a = P[T? = c’}. = 


It is convenient to refer to the simultaneous intervals of. Result 5.3 as 
T?- intervals, since the coverage probability is determined by the distribution of T. 


The successive choices a’ = [1,0,...,0], a’ = [0,1,...,0], and so on through 
a’ = [0,0,..., 1] for the T?-intervals allow us to conclude that 
p(n — 1) Si —) = > 
"Wiln= py erry Sere Fon~p(@) 
p(n — 1) 52 ath = ny 522 
Fy n—p(@) ,/— S pnp SX) + Fyn 
Ca aa eM Gee) EN eon 


all hold simultaneously with confidence coefficient 1 — a. Note that, without modi- 
fying the coefficient 1 — a, we can make statements about the differences wu; ~ bx 
corresponding to a’ = [0,...,0, a;,0,...,0, a,,0,...,0], where a,= 1 and 


226 Chapter 5 Inferences about a Mean Vector 


a, = —1.In this case a'Sa = 5;; — 25;, + 5,,, and we have the statement 
_ _ + [p(n = 1) Sit ~ 2S5ig + Ske 
Xj 7 Xk (n - p) F,n—p(@) = = phi — ME 
és =) p(n — 1) Sig — 253, + S, 
=X; - X + ./=————_ BH POE K kk 
: Es — p) Fon-pl@) Fs (5-25) 


The simultaneous T? confidence intervals are ideal for “data snooping.” The 
confidence coefficient 1 — a remains unchanged for any choice of a, so linear com- 
binations of the components yz; that merit inspection based upon an examination of 
the data can be estimated. 

In addition, according to the results in Supplement 5A, we can include the state. 
ments about (1;, 1.) belonging to the sample mean-centered ellipses j 


as aie Si Sie ("| Xi p(n ~ 1) 
n[X; ~ Hi, Xk a] | E 2 oa = “aap fee-va) (5-26) 
and still maintain the confidence coefficient (1 — a) for the whole set of statements, 
The simultaneous T? confidence intervals for the individual components of a 
mean vector are just the shadows, or projections, of the confidence ellipsoid on the 
component axes. This connection between the shadows of the ellipsoid and the si- 
multaneous confidence intervals given by (5-24) is illustrated in the next example. 


Example 5.4 (Simultaneous confidence intervals as shadows of the confidence ellipsoid) 
In Example 5.3, we obtained the 95% confidence ellipse for the means of the fourth 
roots of the door-closed and door-open microwave radiation measurements. The 95% 
simultaneous T” intervals for the two component means are, from (5-24), 


7 p(n = 1) AY a P( st 
(,- fens, 2+ Be cof) 
2(41 0144 2(41 
_ _ /p(a=) ee a i 
(5 : Tn a py Fon 06 95) 2, bt a - > Fp,n-p(05) ‘2) 
2(41) 0146 2(41 | 
_ («s = 38 yay 08 + 0323 me) OE a 


In Figure 5.2, we have redrawn the 95% confidence ellipse from Example 5.3. 
The 95% simultaneous intervals are shown as shadows, or projections, of this ellipse 
on the axes of the component means. is 


Example 5.5 (Constructing simultaneous confidence intervals and ellipses) The 
scores obtained by n = 87 college students on the College Level Examination Pro- 
gram (CLEP) subtest X; and the College Qualification Test (CQT) subtests X, and 
X; are given in Table 5.2 on page 228 for X, = social science and history, 
X> = verbal, and X3 = science. These data give 


Confidence Regions and Simultaneous Comparisons of Component Means 227 


Hy 


0.66 


0.62 


0.58 


0.54 


0.500 0.552 0.604 


Figure 5.2 Simultaneous T?-intervals for the component means as shadows of the 
confidence ellipse on the axes—microwave radiation data. 


526.29 5808.06 597.84 222.03 
x= | 5469] and S=| 597.84 126.05 23.39 
25.13 222.03 23.39 = 23.11 


Let us compute the 95% simultaneous confidence intervals for p44, “2, and 43. 
We have 


p(n — 1) _ 3(87 ~ 1) = 268) 6) 
“pap nee) = tay ony el) =a G7) = Be) 


and we obtain the simultanéous confidence statements a (5-24)] 


5808. 
526.29 — Va29, | U8 = 526:29 + V8.29 
or 
503.06 < py, < 550.12 
26.05 I 
54.69 - V825,|+ = < py = 54.69 + VB.29 — 
or 


51.22 < py < 58.16 


23. es 
25.13 — V829 |= Le uy < 25.13 + VEQO st 


228 Chapter 5 inferences about a Mean Vector 


Table 5.2 College Test Data 


x 
(Social 
science and science and 
Individual history) | {Verbal) (Science) | Individual _ history) 


Source: Data courtesy of Richard W. Johnson. 


Confidence Regions and Simultaneous Comparisons of Component Means 229 


or 
23.65 < p; < 2661 


With the possible exception of the verbal scores, the marginal Q-Q plots and two- 
dimensional scatter plots do not reveal any serious departures from normality for 
the college qualification test data. (See Exercise 5.18.) Moreover, the sample size is 
large enough to justify the methodology, even though the data are not quite normally 
distributed. (See Section 5.5.) : 

The simultaneous T?-intervals above are wider than univariate intervals because 
all three must hold with 95% confidence. They may also be wider than necessary, be- 
cause, with the same confidence, we can make statements about differences. 

For instance, with a’ = [0, 1, —1], the interval for 42 — 3 has endpoints 


(% — %3) + FeO Fn 05) |e 


126.05 + 23.11 — 2(23.39) 


= (54.69 — 25.13) + V829 a 


= 29.56 + 3.12 
so (26.44, 32.68) is a 95% confidence interval for 42 — 43. Simultaneous intervals 
can also be constructed for the other differences. 

Finally, we can construct confidence ellipses for pairs of means, and the same 
95% confidence holds. For example, for the pair (42, 43), we have 


126.05 23.39 || 54.69 — po 
23.39 23.11 25.13 — py 

= 0.849(54.69 — py)? + 4.633(25.13 — 3)? 

— 2 X 0.859(54.69 — pw) (25.13 — 43) S 8.29 


This ellipse is shown in Figure 5.3 on page 230, along with the 95% confidence ellipses for 
the other two pairs of means. The projections or shadows of these ellipses on the axes are 
also indicated, and these projections are the T?-intervals. = 


87[54.69 — py, 25.13 - wl 


A Comparison of Simultaneous Confidence Intervals 

with One-at-a-Time Intervals 

An alternative approach to the construction of confidence intervals is to consider 
the components yz; one at a time, as suggested by (5-21) with a’ = [0,...,0, 


a;,0,...,0] where a; = 1. This approach ignores the covariance structure of the 
p variables and leads to the intervals 


x Ss 1 Ss 
X1 — ty-1(a/2) As Sp, =X, + t,-1(a/2) =; 


- S. _ Ss 
Ey — tri(/2) J < py = Fy + ty-i(a/2) | (5-27) 


2 Spp Z 7 
Xp — tn-1(a/2) J S Mp S Xp + ty-1(a/2) J 


230 Chapter 5 Inferences about a Mean Vector 


58 


54 


50 


27 


25 


23 


Hy 


54.5 58.5 


Figure 5.3 95% confidence ellipses for pairs of means and the simultaneous 
T?-intervals—college test data. 


Although prior to sampling, the ith interval has probability 1 — a of covering y;, 
we do not know what to assert, in general, about the probability of all intervals con- 
taining their respective y,’s. As we have pointed out, this probability is not 1 — a. 


To shed some light on the problem, consider the special case where the obser- 
vations have a joint normal distribution and 


oy 0 -- O 
== 0 sf 0 
0 0 aoe opp 


Since the observations on the first variable are independent of those on the second 
variable, and so on, the product rule for independent events can be applied. Before 


the sample is selected, 


P[all t-intervals in (5-27) contain the ,’s] = (1-a)(1-a)---(1- @) 
= (1 -— a)? 


If 1 — a = .95 and p = 6, this probability is (95) = .74. 


Confidence Regions and Simultaneous Comparisons of Component Means 231 


To guarantee a probability of 1 — a that all of the statements about the compo- 
nent means hold simultaneously, the individual intervals must be wider than the sepa- 
rate t-intervals; just how much wider depends on both p and n, as wellas on 1 — a. 

For 1 — a = .95, n = 15, and p = 4, the multipliers of Vs,;/n in (5-24) and 


(5-27) are 
Fr Fina 05) = a” (3.36) = 4.14 


and t,-;(.025) = 2.145, respectively. Consequently, in this case the simultaneous in- 
tervals are 100(4.14 — 2.145)/2.145 = 93% wider than those derived from the one- 
at-a-time ¢ method. 

Table 5.3 gives some critical distance multipliers for one-at-a-time t-intervals 
computed according to (5-21), as well as the corresponding simultaneous T?-inter- 
vals. In general, the width of the T?-intervals, relative to the t-intervals, increases for 
fixed n as p increases and decreases for fixed p as n increases. 


Table 5.3 Critical Distance Multipliers for One-at-a-Time t- Intervals and 
T*-Intervals for Selected n and p (1 — a = .95) 


tn-1(.025) 


The comparison implied by Table 5.3 is a bit unfair, since the confidence level 
associated with any collection of T?-intervals, for fixed n and p, is .95, and the over- 
all confidence associated with a collection of individual ft intervals, for the same 7, 
can, as we have seen, be much less than .95. The one-at-a-time ¢ intervals are too 
short to maintain an overall confidence level for separate statements about, say, all 
p means. Nevertheless, we sometimes look at them as the best possible information 
concerning a mean, if this is the only inference to be made. Moreover, if the one-at- 
a-time intervals are calculated only when the T?-test rejects the null hypothesis, 
some researchers think they may more accurately represent the information about 
the means than the T?-intervals do. 

The T?-intervals are too wide if they are applied only to the p component means. 
To see why, consider the confidence ellipse and the simultaneous intervals shown in 
Figure 5.2. If 1; lies in its T?-interval and 1 lies in its T’-interval, then (4; , 22) lies in 
the rectangle formed by these two intervals. This rectangle contains the confidence 
ellipse and more. The confidence ellipse is smaller but has probability .95 of covering 
the mean vector ye with its component means py, and py. Consequently, the probabil- 
ity of covering the two individual means p; and pz will be larger than .95 for the rec- 
tangle formed by the T?-intervals. This result leads us to consider a second approach 
to making-multiple comparisons known as the Bonferroni method. 


232 Chapter 5 Inferences about a Mean Vector 


The Bonferroni Method of Multiple Comparisons - 


Often, attention is restricted to a small number of individual confidence statements 

In these situations it is possible to do better than the simultaneous intervals of 
Result 5.3. If the number m of specified component means ,4; or linear combinations” 
a’ ft = ajp, + Gop + --- + apy, is small, simultaneous confidence intervals can be 

developed that are shorter (more precise) than the simultaneous T-intervals, 

The alternative method for multiple comparisons is called the Bonferroni method 
because it is developed from a probability inequality carrying that name. 

Suppose that, prior to the collection of data, confidence statements about 77 lin. 
ear combinations ajy, ab,..., a7," are required. Let C; denote a confidence state. 
ment about the value of a/y with P[C,true] = 1 - a,;,i = 1,2,...,m. Now (see 
Exercise 5.6), , 


P{allC, true} = 1 — P[atleast one C; false] 
21- YS P(C;false) =1- ¥ (1 — P(C, true)) 
f=) i=l 

=1- (a; +a) +--- + a,) (5-28) 
Inequality (5-28), a special case of the Bonferroni inequality, allows an investi- 
gator to control the overal] error rate a) + a2 + -+: + a», regardless of the correla- 
tion structure behind the confidence statements. There is also the flexibility of 
controlling the error rate for a group of important statements and balancing it by 


another choice for the less important statements. 
Let us develop simultaneous interval estimates for the restricted set consisting 


of the components y,; of 4. Lacking information on the relative importance of these 
components, we consider the individual t-intervals 


Qa; S; a 
ir oa Z} = i=1,2,....m 


with a;~a/m. Since P{X; + ty-1(a/2m) Vs,/n contains 2] = 1 ~ a/m, 
i= 1,2,...,m, we have, from (5-28), 


=. a\ [s; . : a a 
Al + wil) = contains 4, a | =h= (= + i tet 2) 
m 
__ ee 


m terms 


5. 


=l-a 


Therefore, with an overall confidence level greater than or equal to 1 — a, we can 
make the following m = p statements: 


= a [sit = a 511 
ot — —< < + — ae 
xy i) n = py SX} nl 3) n 
oa a $22 _ a RY 
XQ - trl) a Spy SXF s(%) ger (5-29) 


a a Spp a a Spp 
x ~ t 3 —_ —_ <= — x + t = — ——— 
ee (+) n f caine (#) n 


Confidence Regions and Simultaneous Comparisons of Component Means 233 


The statements in (5-29) can be compared with those in (5-24). The percentage 
point t,-;(a/2p) replaces Vin — 1)pF,,,-,(@)/(n — p), but otherwise the inter- 
vals are of the same structure. 


Example 5.6 (Constructing Bonferroni simultaneous confidence intervals and com- 
paring them with T?-intervals) Let us return to the microwave oven radiation data 
in Examples 5.3 and 5.4. We shall obtain the simultaneous 95% Bonferroni confi- 
dence intervals for the means, # and yz, of the fourth roots of the door-closed and 
door-open measurements with a; = .05/2, i = 1,2. We make use of the results in 
Example 5.3, noting that m = 42 and t4)(.05/2(2)) = t4(.0125) = 2.327, to get 


0144 
%, + tal. 0125) |? = 564 + 2.327, /— ge Of 2S my S607 


ea 4 
Xq £ t4,(.0125) = 603 + 2.327 vee or .560 <= p2 = 646 


Figure 5.4 shows the 95% T? simultaneous confidence intervals for 4, #2 from 
Figure 5.2, along with the corresponding 95% Bonferroni intervals. For each com- 
ponent mean, the Bonferroni interval falls within the T?-interval. Consequently, 
the rectangular (joint) region formed by the two Bonferroni intervals is contained 
in the rectangular region formed by the two T”-intervals. If we are interested only in 
the component means, the Bonferroni intervats provide more precise estimates than 


By 


0.62 


T? 


Bonferroni 


0.500 0.552 0.604 


Figure 5.4 The 95% T? and 95% Bonferroni simultaneous confidence intervals for the 
component means—microwave radiation data. 


234 Chapter 5 Inferences about a Mean Vector 


Adee 


the T?-intervals. On the other hand, the 95% confidence region for x gives the 
plausible values for the pairs (41, #2) when the correlation between the measured 
variables is taken into account. a 


oes tithesdis MN ed 


i 


The Bonferroni intervals for linear combinations a’# and the analogous 
T?-intervals (recall Result 5:3) have the same general form: 


a’X + (critical value) = “3 


“3 


Consequently, in every instance where a; = a/m, 


Length of Bonferroni interval _ t,-(a/2m) 


Length of T-interval p(n—1)_. 
“nap (Paola) 
which does not depend on the random quantities X and S. As we have pointed out, for 
a small number m of specified parametric functions a’, the Bonferroni intervals will - 
always be shorter. How much shorter is indicated in Table 5.4 for selected 7 and p. 


(5-30) : 


Table 5.4 (Length of Bonferroni Interval) /(Length of T?-Interval) 
for 1 — a = 95 and a; = .05/m 
m=p 
n 2 4 10 
15 88 69 .29 
25 90 15 A8 
50 ‘91 8 ‘58 
100 91 80 62 
fora) 91 81 66 


We see from Table 5.4 that the Bonferroni method provides shorter intervals 
when m = p. Because they are easy to apply and provide the relatively short confi- 
dence intervals needed for inference, we will often apply simultaneous t-intervals 
based on the Bonferroni method. 


5.5 Large Sample Inferences about a Population Mean Vector 


When the sample size is large, tests of hypotheses and confidence regions for ya can 
be constructed without the assumption of a norma) population. As illustrated in 
Exercises 5.15, 5.16, and 5.17, for large n, we are able to make inferences about the 
population mean even though the parent distribution is discrete. In fact, serious de- 
partures from a normal population can be overcome by large sample sizes. Both 
tests of hypotheses and simultaneous confidence statements will then possess (ap- 
proximately) their nominal levels. 

The advantages associated with large samples may be partially offset by a loss in 
sample information caused by using only the summary statistics ¥, and S. On the 
other hand, since (X, S) is a sufficient summary for normal populations [see (4-21)], 


Large Sample Inferences about a Population Mean Vector 235 


the closer the underlying population is to multivariate normal, the more efficiently 
the sample information will be utilized in making inferences. 

All large-sample inferences about yx are based ona y7-distribution. From (4-28), 
we know that (X — w)'(m'S) (XK — w) = n(X — p)'S(X — pw) is approxi- 
mately y? with p d.f., and thus, 


P[n(X& — p)'S(X - w) = x7(a)]+1—-a (5-31) 


where y3(q) is the upper (100a)th percentile of the x3-distribution. 
Equation (5-31) immediately leads to large sample tests of hypotheses and simul- 
taneous confidence regions. These procedures are summarized in Results 5.4 and 5.5. 


Result 5.4. Let X,, X2,..., X,, be a random sample from a population with mean 
p and positive definite covariance matrix &. When n — p is large, the hypothesis 
Ho: @ = fo is rejected in favor of Hy: a # po, at a level of significance approxi- 
mately a, if the observed 


n(% — pro)’ S'(% — pro) > x3(a) 


Here x3(a) is the upper (100a)th percentile of a chi-square distribution with pdf ™ 


Comparing the test in Result 5.4 with the corresponding normal theory test in 
(5-7), we see that the test statistics have the same structure, but the critical values 
are different. A closer examination, however, reveals that both tests yield essential- 
ly the same result in situations where the y?-test of Result 5.4 is appropriate. This 
follows directly from the fact that (n — 1)pF,,,-p(@)/(n — p) and y2(a) are ap- 
proximately equal for n large relative to p. (See Tables 3 and 4 in the appendix.) 


Result 5.5. Let X,, X,,..., X,, be a random sample from a population with mean 
p and positive definite covariance &. If n — p is large, 


ss 'S 
aX + V x3(a@) — 


will contain a’, for every a, with probability approximately 1 — a. Consequently, 
we can make the 100(1 — a)% simultaneous confidence statements 


5 
% + Vyx4(a) re contains p41 
5 
Xt Vx2(a) , i contains p2 


5 
Xp + Vx2(a) , > contains pp 


and, in addition, for all pairs (4;, u,),i,k = 1,2,..., p, the sample mean-centered 
ellipses 


-1 _ 

= = Siig Sj X;- wy . 

n[X,— wi, Xe — Mel) aie |= xp(@) contain (;, Hx) 
Sik Skk XK — Me 


236 


Chapter 5 Inferences about a Mean Vector 


Proof. The first part follows from Result 5A.1, with c? = x3(@). The probability 
level is a consequence of (5-31). The statements for the y; are obtained by the spe. 
cial choices a’ = [0,...,0, @;,0,...,0], where a; = 1,i = 1,2,..., p. The ellipsoids 
for pairs of means follow from Result 5A.2 with c? = y2(a). The overall confidence 
level of approximately 1 — a for all statements is, once again, a result of the large - 
sample distribution theory summarized in (5-31). _ 


The question of what is a large sample size is not easy to answer. In one or two 
dimensions, sample sizes in the range 30 to 50 can usually be considered large. As 
the number characteristics becomes large, certainly larger sample sizes are required 
for the asymptotic distributions to provide good approximations to the true distrib- 
utions of various test statistics. Lacking definitive studies, we simply state that n — p | 
must be large and realize that the true case is more complicated. An application 
with p = 2 and sample size 50 is much different than an application with p = 52 and 
sample size 100 although both have n — p = 48. 

It is good statistical practice to subject these large sample inference procedures 
to the same checks required of the normal-theory methods. Although small to © 
moderate departures from normality do not cause any difficulties for n large, 
extreme deviations could cause problems. Specifically, the true error rate may be far 
removed from the nominal level a. If, on the basis of Q-@ plots and other investiga- 
tive devices, outliers and other forms of extreme departures are indicated (see, for 
example, [2]), appropriate corrective actions, including transformations, are desir- 
able. Methods for testing mean vectors of symmetric multivariate distributions that 
are relatively insensitive to departures from normality are discussed in [11]. In some 
instances, Results 5.4 and 5.5 are useful only for very large samples. : 

The next example allows us to illustrate the construction of large sample simul- 
taneous statements for all single mean components. 


Example 5.7 (Constructing large sample simultaneous confidence intervals) A music 
educator tested thousands of Finnish students on their native musical ability in order 
to set national norms in Finland. Summary statistics for part of the data set are given 
in Table 5.5. These statistics are based on a sample of n = 96 Finnish 12th graders. 


Table 5.5 Musical Aptitude Profile Means and Standard Deviations for 96 
12th-Grade Finnish Students Participating in a Standardization Program 
Raw score 

Variable Standard deviation (Vs;;) 
X, = melody 28.1 

X, = harmony 26.6 

X3 = tempo 35.4 

X4 = meter 34.2 

Xs = phrasing 23.6 

X¢ = balance 22.0 

X7 = Style 22.7 


| Source: Data courtesy of V. Sell. 


Large Sample Inferences about a Population Mean Vector 237 


Let us construct 90% simultaneous confidence intervals for the individual mean 
components yw;,i = 1,2,...,7. 
From Result 5.5, simultaneous 90% confidence limits are given by 


rae Vx¥A(40) ,/*4, i = 1,2,...,7, where y3(.10) = 12.02. Thus, with approxi- 
mately 90% confidence, 

5.16 
V9 


5.85 
26.6 + 12.02 contains yz, or 24.53 S py, = 28.67 


3.82 
35.4 + VI202——= containsys or 34.05 = ys = 36.75 


28.1 + V12.02 contains, or 26.06 = pw, = 30.14 


5.12 
3424 12.02 56 contains ps or 32.395 By = 36.01 


3.76 
23.6 + V12.02 5 containsys5 or 22.27 = ws = 24.93 


22.0 + Vin contains ye or 20.61 = pe = 23.39 


4, 
22.7 + Vin@ contains, or 21.27 < pw, = 24.13 


Based, perhaps, upon thousands of American students, the investigator could hy- 
pothesize the musical aptitude profile to be 


pw = [31, 27, 34, 31, 23, 22, 22] 


We see from the simultaneous statements above that the melody, tempo, and meter 
components of jz do not appear to be plausible values for the corresponding means 
of Finnish scores. _ 


When the sample size is Jarge, the one-at-a-time confidence intervals for indi- 
vidual means are 


Xi = Sti < -<% bed Si a 
x; ($) Hs w= 3, + 2() A i=1,2,...,p 


where z(a@/2) is the upper 100(a/2)th percentile of the standard normal distribu- 
tion. The Bonferroni simultaneous confidence intervals for the m = p statements 
about the individual means take the same form, but use the modified percentile 
z(a/2p) to give 


em | ce le ae Bei (a (ae 
Xj (=) Hs w=, +2(2) ; i=1,2,...,p 


238 Chapter 5 Inferences about a Mean Vector 


Table 5.6 gives the individual, Bonferroni, and chi-square-based (or shadow of ° 
the confidence ellipsoid) mtervals for the musical aptitude data in Example 5.7, 


Table 5.6 The Large Sample 95% Individual, Bonferroni, and T?-Intervals for 
the Musical Aptitude Data 


The one-at-a-time confidence intervals use z(.025) = 1.96. 
The simultaneous Bonferroni intervals use z(.025/7) = 2.69. 
The simultaneous T?, or shadows of the ellipsoid, use y3(.05) = 14.07. 


Shadow of Ellipsoid 
Lower Upper 


Bonferroni Intervals 
Upper 


One-at-a-time 
Lower Upper 


Variable 


X, = melody ; I 
X, = harmony | 25.43 27.77 
X3 = tempo 34.64 36.16 
X4 = meter 33.18 35.22 
Xs = phrasing | 22.85 24.35 
X¢ = balance 21.21 22.79 

X; = style 21.89 23.51 


Although the sample size may be large, some statisticians prefer to retain the 
F- and ¢-based percentiles rather than use the chi-square or standard normal-based 
percentiles. The latter constants are the infinite sample size limits of the former 
constants. The F and t percentiles produce larger intervals and, hence, are more con- 
servative. Table 5.7 gives the individual, Bonferroni, and F-based, or shadow of the 
confidence ellipsoid, intervals for the musical aptitude data. Comparing Table 5.7 
with Table 5.6, we see that all of the intervals in Table 5.7 are larger. However, with 
the relatively large sample size n = 96, the differences are typically in the third, or 
tenths, digit. 


_——— 
Table 5.7 The 95% Individual, Bonferroni, and T?-Intervals for the 
Musical Aptitude Data 


The one-at-a-time confidence intervals use tg5(.025) = 1.99. 
The simultaneous Bonferroni intervals use t95(.025/7) = 2.75. 
The simultaneous T?, or shadows of the ellipsoid, use Fy g9(.05) = 2.11. 


Bonferroni Intervals | Shadow of Ellipsoid 
Lower Upper Lower Upper 


26.48 29.72 25.76 30.44 


One-at-a-time 
Lower Upper 


Variable 


X, = harmony | 25.41 27.79 
X; = tempo 34.63 36.17 
X, = meter 33.16 35.24 


Xs; = phrasing | 22.84 24.36 
balance 21.20 22.80 
style 21.88 23.52 


ne 
Ta 


Multivariate Quality Control Charts 239 


5.6 Multivariate Quality Control Charts 


To improve the quality of goods and services, data need to be examined for causes 
of variation. When a manufacturing process is continuously producing items or 
when we are monitoring activities of a service, data should be collected to evaluate 
the capabilities and stability of the process. When a process is stable, the variation is 
produced by common causes that are always present, and no one cause is a major 
source of variation. 

The purpose of any control chart is to identify occurrences of special causes of 
variation that come from outside of the usual process. These causes of variation 
often indicate a need for a timely repair, but they can also suggest improvements to 
the process. Control charts make the variation visible and allow one to distinguish 
common from special causes of variation. 

A control chart typically consists of data plotted in time order and horizontal 
lines, called contro] limits, that indicate the amount of variation due to common 
causes. One useful control chart is the X -chart (read X-bar chart). To create an 


X -chart, 


1. Plot the individual observations or sample means in time order. 
2. Create and plot the centerline x, the sample mean of all of the observations. 
3. Calculate and plot the control limits given by 


tall] 


Upper control limit (UCL) = x + 3(standard deviation) 


Lower control limit (LCL) = x — 3(standard deviation) 


The standard deviation in the control limits is the estimated standard deviation 
of the observations being plotted. For single observations, it is often the sample 
standard deviation. If the means of subsamples of size m are plotted, then 
the standard deviation is the sample standard deviation divided by Vm. The 
control limits of plus and minus three standard deviations are chosen so that 
there is a very small chance, assuming normally distributed data, of falsely signal- 
ing an out-of-control observation—that is, an observation suggesting a special 
cause of variation. 


Example 5.8 (Creating a univariate control chart) The Madison, Wisconsin, police 
department regularly monitors many of its activities as part of an ongoing quality 
improvement program. Table 5.8 gives the data on five different kinds of over- 
time hours. Each observation represents a total for 12 pay periods, or about half 
a year. 

We examine the stability of the legal appearances overtime hours. A computer 
calculation gives x, = 3558. Since individual values will be plotted, x, is the same as 
X}. Also, the sample standard deviation is Vs,; = 607, and the control limits are 


UCL = ¥, + 3(-V5q1) = 3558 + 3(607) = 5379 
LCL = ¥, — 3(Vsq1) = 3558 — 3(607) = 1737 


240 Chapter 5 Inferences about a Mean Vector 


Table 5.8 Five Types of Overtime Hours for the Madison, Wisconsin, Police 
Department 


“1 % 43 X4 Xs 

Legal Appearances Extraordinary Holdover COA! Meeting |. 

Hours ~ Event Hours Hours Hours Hours. |-. 
2200 1181 14,861 236 


875 3532 11,367 310 
957 2502 13,329 1182 


1758 4510 12,328 1208 
868 3032 12,847 1385 
398 2130 13,979 1053 
1603 1982 13,528 1046 
523 4675 12,699 1100 
2034 2354 13,534 1349 
1136 4606 11,609 1150 
5326 3044 14,189 1216 
1658 3340 15,052 660 
1945 2111 12,236 299 
344 1291 15,482 206 
807 1365 14,900 


15,078 


) Compensatory overtime allowed. 


The data, along with the centerline and control limits, are plotted as an X -chart in 


Figure 5.5. 
Legal Appearances Overtime Hours 
5500 aoe ————-——— } UCL = 5379 
4500 
3 
a 
> 
3 a ¥, = 3558 
s 
3 
2500 
LCL = 1737 
1500 


Observation Number 


Figure 5.5 TheX -chart for x, = legal appearances overtime hours. 


Multivariate Quality Control Charts 241 


The legal appearances overtime hours are stable over the period in which the 
data were collected. The variation in overtime hours appears to be due to common 
causes, so no special-cause variation is indicated. = 


With more than one important characteristic, a multivariate approach should be 
used to monitor process stability. Such an approach can account for correlations 
between characteristics and will control the overall probability of falsely signaling a 
special cause of variation when one is not present. High correlations among the 
variables can make it impossible to assess the overall error rate that is implied by a 
large number of univariate charts. 

The two most common multivariate charts are (i) the ellipse format chart and 
(ii) the T?-chart. 

Two cases that arise in practice need to be treated differently: 


1. Monitoring the stability of a given sample of multivariate observations 
2. Setting a control region for future observations 


Initially, we consider the use of multivariate control procedures for a sample of mul- 
tivariate observations x,,X,...,X,- Later, we discuss these procedures when the 
observations are subgroup means. 


Charts for Monitoring a Sample of Individual Multivariate 
Observations for Stability 


We assume that X,,X2,...,X,, are independently distributed as N,(m,%). By 
Result 4.8, . 


= 1 1 1 1, 
x,- X= (1-4)x,-4x,- ~ 5 Xi — 5 Xin ~—xX, 
has 
E(X; — X) =0=(1—n")p- (n- 1)n 
and 
- 1\ 2 (n — 1) 
Cov(X; -—X) =(1-> y+ (n- 1)n7Z = 7 x 


Each X; — X has a normal distribution but, X; — X is not independent of the sam- 
ple covariance matrix S. However to set control limits, we approximate that 
(X; - X)'S1(X; — X) has a chi-square distribution. 


Ellipse Format Chart. The ellipse format chart for a bivariate control region is the 
more intuitive of the charts, but its approach is limited to two variables. The two 
characteristics on the jth unit are plotted as a pair (x;1, x2). The 95% quality ellipse 
consists of all x that satisfy 


(x — ¥)/S7(x — x) = y3(.05) (5-32) 


242 Chapter 5S Inferences about a Mean Vector 


Example 5.9 (An ellipse format chart for overtime hours) Let us refer to Example 
5.8 and create a quality ellipse for the pair of overtime characteristics (legal appear. 
ances, extraordinary event) hours. A computer calculation gives 


~~ _ [3558] 4 g =| 367.8847 ~72,093.8 
an ~72,093.8 1,399,053. 


We illustrate the quality ellipse format chart using the 99% ellipse, which con- 
sists of all x that satisfy 


(x — x)'S1(x — x) s y4(.01) 
Here p = 2, so x3(.01) = 9.21, and the ellipse becomes 
$11522 (“ - %)" rs (x1 — %1) (42 = %2) (x2 - 2 


12 
$14892 — $22 $1) $1 1522 %2 


(367844.7 x 1399053.1) 
~ 367844.7 x 1399053.1 - (-72093.8)? 


(x, ~ 3558) (x1 — 3558) (x, - 1478) (x2 — 1478)? 
anh! _ 9 -72093,8) Ep : 
( 3678447 2672038) seo e4a7 x 13000531 * 13090531) = 92 


This ellipse format chart is graphed, along with the pairs of data, in Figure 5.6, 


3000 5000 


Extraordinary Event Overtime 
1000 


-2000 


Figure 5.6 The quality control 
1500 2500 3500 4500 5500 99% ellipse for legal 
appearances and extraordinary 
Appearances Overtime event overtime. 


Multivariate Quality Control Charts 243 


Extraordinary Event Hours 


UCL = 5027 
o 
2 
os 
> 
3 5, = 1478 
> 
= 
= 
LCL = ~2071 


0 5 10 15 
Observation Number 


Figure 5.7 TheX -chart for x2 = extraordinary event hours. 


Notice that one point, indicated with an arrow, is definitely outside of the el- 
lipse. When a point is out of the control region, individual Y charts are constructed. 
TheX -chart for x, was given in Figure 5.5; that for x, is given in Figure 5.7. 

When the lower control limit is less than zero for data that must be non- 
negative, it is generally set to zero. The LCL = 0 limit is shown by the dashed line in 
Figure 5.7. 

Was there a special cause of the single point for extraordinary event overtime 
that is outside the upper control limit in Figure 5.7? During this period, the United 
States bombed a foreign capital, and students at Madison were protesting. A major- 
ity of the extraordinary overtime was used in that four-week period. Although, by its 
very definition, extraordinary overtime occurs only when special events occur and is 
therefore unpredictable, it still has a certain stability. : td] 


T-Chart. A T?-chart can be applied to a large number of characteristics. Unlike the 

ellipse format, it is not limited to two variables. Moreover, the points are displayed in 

time order rather than as a scatter plot, and this makes patterns and trends visible. 
For the jth point, we calculate the T-statistic 


T} = (xj — x)'S (x; — ¥) (5-33) 


We then plot the T?-values on a time axis. The lower control limit is zero, and we use 
the upper control limit : 


UCL = 3(.05) 


or, sometimes, y7(.01). 
There is no centerline in the T?-chart. Notice that the T?-statistic is the same as 
the quantity d} used to test normality in Section 4.6, 


244 Chapter 5 Inferences about a Mean Vector 


z 
se 
a 


Example 5.10 (A T?-chart for overtime hours) Using the police department data in’ 
Example 5.8, we construct a F?-plot based on the two variables X, = legal appear. 
ances hours and X, = extraordinary event hours T?-charts with more than two" 
variables are considered in Exercise 5.26. We take a = .01 to be consistent with3s 
the ellipse format chart in Example 5.9. os 


The 7?-chart in Figure 5.8 reveals that the pair (legal appearances, extraordi-= 
nary event) hours for period 11 is out of control. Further investigation, as in Exany-4 
ple 5.9, confirms that this is due to the large value of extraordinary event overtime 4 


during that period. ; = 


12 


10 


Period 


Figure 5.8 The T?-chart for legal appearances hours and extraordinary event hours, a = .01. 


When the multivariate 7?-chart signals that the jth unit is out of control, it should 
be determined which variables are responsible. A modified region based on Bonferroni 
intervals is frequently chosen for this purpose. The kth variable is out of control if x;, 
does not lie in the interval 


(Xe — th-1(.005/p)VSk~> Xe + tr-1(005/p) Vsxe ) 


where p is the total number of measured variables. 


Example 5.11 (Control of robotic welders—more than T’ needed) The assembly ofa 
driveshaft for an automobile requires the circle welding of tube yokes to a tube. The 
inputs to the automated welding machines must be controlled to be within certain 
operating limits where a machine produces welds of good quality. In order to con- 
trol the process, one process engineer measured four critical variables: 

X, = Voltage (volts) 

X, = Current (amps) 

X; = Feed speed(in/min) 

X, = (inert) Gas flow (cfm) 


Multivariate Quality Control Charts 245 


Table 5.9 gives the values of these variables at five-second intervals. 


' Table 5.9 Welder Data 


Case  Voltage(X,) Current (X) Feed speed(X3) Gas flow (X) | 
1 23.0 276 289.6 51.0 
2 22.0 281 289.0 $1.7 
3 228 270 288.2 $1.3 
4 22,1 278 288.0 52.3 
5 22.5 275 288.0 53.0 
6 22.2 273 288.0 51.0 
7 22.0 275 290.0 53.0 
8 22.1 268 289.0 54.0 
9 22.5 277 289.0 52.0 

10 22.5 278 289.0 52.0 
11 22,3 269 287.0 54.0 
12 21.8 274 287.6 52.0 
13- 22.3 270 288.4 51.0 
14 22.2 273 290.2 $1.3 
15 22.1 274 286.0 51.0 
16 22.1 277 287.0 52.0 
17 21.8 277 287.0 51.0 
18 22.6 276 290.0 51.0 
19 22.3 278 287.0 51.7 
20 23.0 266 289.1 51.0 
21 22.9 271 288.3 51.0 
22 213 274 289.0 52.0 
23 21.8 280 290.0 52.0 
24 22.0 268 288.3 51.0 
25 22.8 269 288.7 52.0 
26 22.0 264 290.0 51.0 
27 22.5 273 288.6 52.0 
28 22.2 269 288.2 52,0 
29 22.6 273 286.0 52.0 _ 
30 21.7 283 290.0 52.7 
31 21.9 273 288.7 55.3 
32 22.3 264 287.0 52.0 
33 22.2 263 288.0 52.0 
34 22.3 * 266 288.6 51.7 
35 22.0 263 288.0 51.7 
36 22.8 272 289.0 52.3 
37 22.0 277 287.7 53.3 
38 22.7 272 289.0 52.0 
39 22.6 274 287.2 $2.7 
40 22.7 270 290.0 51.0 


Source: Data courtesy of Mark Abbotoy. 


246 Chapter 5 Inferences about a Mean Vector 


The normal assumption is reasonable for most variables, but we take the natur. 
al logarithm of gas flow. In addition, there is no appreciable serial correlation for. 
successive observations on each variable. 

A T?-chart for the four welding variables is given in Figure 5.9. The dotted line 4 
is the 95% limit and the solid line is the 99% limit. Using the 99% limit, no points 
are out of control, but case 31 is outside the 95% limit. 4 

What do the quality control ellipses (ellipse format charts) show for two vari- i 
ables? Most of the variables are in control. However, the 99% quality ellipse for gas #3 
flow and voltage, shown in Figure 5.10, reveals that case 31 is out of control and 
this is due to an unusually large volume of gas flow. The univariate X chart for 
In(gas flow), in Figure 5.11, shows that this point is outside the three sigma limits 
It appears that gas flow was reset at the target for case 32. All the other univariate .. 
X -charts have all points within their three sigma control limits, : 


vedutie BYE eR 


99% Limit 


95% Limit 


Figure 5.9 The T?-chart for the 


20 30 : : 
0 9 welding data with 95% and 
Case 99% limits. 
4.05 
e 
4.00 
z 
io} 
a 
8 3.95 
= 
3.90 
3.85 


Figure 5.10 The 99% quality 
205 21.0 215 22.0 225 23.0 23.5 24.0 control ellipse for In(gas flow) and 
Voltage voltage. 


Multivariate Quality Control Charts 247 


4,00 UCL = 4.005 
pe 3.95 Mean = 3.951 
3.90 =) —____________________ | Lc = 3.896 
0 1 0 ry . . 
fe ss ” Figure 5.11 The univariate 
Case X -chart for \n(gas flow). 


In this example, a shift in a single variable was masked with 99% limits, or almost 
masked (with 95% limits), by being combined into a single T?-value. Vd] 


Control Regions for Future Individual Observations 


The goal now is to use data x), X2,...,X,,, collected when a process is stable, to set a 
control region for a future observation x or future observations. The region in which 
a future observation is expected to lie is called a forecast, or prediction, region. If the 
process is stable, we take the observations to be independently distributed as 
N,(#, %). Because these regions are of more general importance than just for mon- 
itoring quality, we give the basic distribution theory as Result 5.6. 


Result 5.6. Let X,, X>,..., X,, be independently distributed as N,(m#, %), and let 
X be a future observation from the same distribution. Then 
2 n (n—-1)p 


_ _ Pel _ WwW: : 5 Me, 
T ai 1 (X X) S“"(X — X) is distributed as a Fp.n-p 


and a 100(1 — a)% p-dimensional prediction ellipsoid is given by all x satisfying 
& y 


ie 
(x -— x)S"(x-x) = ee Faw ole) 


Proof. We first note that X — X has mean 0. Since X is a future observation, X and 
X are independent, so 


ae ; = 1 (n + 1) 
Cov(XK — X) = Cov(X) + Cov(X) = 2 + ne = Pe 
and, by Result 4.8, Vn/(n + 1) (X — X) is distributed as N,(0, X). Now, 


n 
nt+1 


n 
nt+1 


(X — X)'s" 


(X - X) 


248 Chapter 5 Inferences about a Mean Vector 


which combines a multivariate normal, N,(0, 2%), random vector and an independent ° 
Wishart, W, ,-1(%), random matrix in the form 


nes =) ae random matt)" geen ae 


random vector df. random vector 
has the scaled “F distribution claimed according to (5-8) and the discussion on 
page 213. 
The constant for the ellipsoid follows from (5-6). a 


Note that the prediction region in Result 5.6 for a future observed value x is an . 
ellipsoid. It is centered at the initial sample mean x, and its axes are determined by © 
the eigenvectors of S. Since 

P (x - Kystx - x) < oP, (a) }=1- 
: “n(n — p) 2"? : 


before any new observations are taken, the probability that X will fall in the predic- 
tion ellipse is1 — a. 

Keep in mind that the current observations must be stable before they can be 
used to determine control regions for future observations. 

Based on Result 5.6, we obtain the two charts for future observations. 


Control Ellipse for Future Observations 


With p = 2, the 95% prediction ellipse in Result 5.6 specializes to 
(n? — 1)2 
n(n — 2) 
Any future observation x is declared to be out of control if it falls out of the con- 
trol ellipse. 


(x — x/'S1(x - %) < F,-2(.05) (5-34) 


Example 5.12 (A control ellipse for future overtime hours) In Example 5.9, we 
checked the stability of legal appearances and extraordinary event overtime hours. 
Let’s use these data to determine a control region for future pairs of values. 

From Example 5.9 and Figure 5.6, we find that the pair of values for period 11 
were out of control. We removed this point and determined the new 99% ellipse. All 
of the points are then in control, so they can serve to determine the 95% prediction 
region just defined for p = 2. This control ellipse is shown in Figure 5.12 along with 
the initial 15 stable observations. 

Any future observation falling in the ellipse is regarded as stable or in control. 
An observation outside of the ellipse represents a potential out-of-control observa- 
tion or special-cause variation. | 


T?-Chart for Future Observations 


For each new observation x, plot 


2 


; (x — x)'S (x — x) 


Extraordinary Event Overtime 


Multivariate Quality Control Charts 249 


3000 


2000 


1000 


3 
wm 
o 
8 
a 
ier ieee Figure 5.12 The 95% control 
1500 2500 3500 4500 5500 ellipse for future legal 
appearances and extraordinary 
Appearances Overtime event overtime. 


in time order. Set LCL = 0, and take 


(n —1)p 
(n — p) 


Points above the upper control limit represent potential special cause variation 
and suggest that the process in question should be examined to determine 
whether immediate corrective action is warranted. See [9] for discussion of other 
procedures. 


UCL = F,,n—p(.05) 


Control Charts Based on Subsample Means 


It is assumed that each random vector of observations from the process is indepen- 
dently distributed as N,(0, 2%). We proceed differently when the sampling procedure 
specifies that m > 1 units be selected, at the same time, from the process. From the 
first sample, we determine its sample mean X, and covariance matrix S,. When 
the population is normal, these two random quantities are independent. 

For a general subsample mean X,, X; — X has a normal distribution with 
mean 0 and 


n 1 — (n — 1) 
rw Cov (X,) Peer 


Cov(X; — X) = ( = xy Cov (Xj) + 


250 Chapter 5 Inferences about a Mean Vector 


where 


As will be described in Section 6.4, the sample covariances from the n sub. 
samples can be combined to give a single estimate (called Sjoojeq in Chapter 6) of the 
common covariance %. This pooled estimate is 


$= 1 (8, +5 +--+8,) 


Here (nm — n)S§ is independent of each XxX; and, therefore, of their mean X. 
Further, (2m — 7)S is distributed as a Wishart random matrix with nm — n deprees 
of freedom. Notice that we are estimating % internally from the data collected in 
any given period. These estimators are combined to give a single estimator with a 
large number of degrees of freedom. Consequently, 
nm i= 


(X, - X)'S7(X; - X) (5-35) 


T? = 
n-1 

is distributed as 
(nm — n)p 


—__——__ F 
(nm ~—n - p +1) 


p,nm—n~p+1 


Ellipse Format Chart. In an analogous fashion to our discussion on individual 
multivariate observations, the ellipse format chart for pairs of subsample means is 


G-2s'G-H= (n ~ 1)(m— 1)2 


ee | FP ym-n-1(.05) (5-36) 
although the right-hand side is usually approximated as x3(.05)/m. 

Subsamples corresponding to points outside of the control ellipse should be 
carefully checked for changes in the behavior of the quality characteristics being 
measured. The interested reader is referred to [10] for additiona) discussion. 


T*-Chart. To construct a 7?-chart with subsample data and p characteristics, we 
plot the quantity 


T? = m(X; - X)'S1(X; - X) 
for j = 1,2,...,, where the 


(n ~1)(m =~ 1)p 


ee Gar hp ED) 


Fy nm-n-p+i(-05) 
The UCL is often approximated as y2(.05) when 7 is large. 

Values of T} that exceed the UCL correspond to potentially out-of-control or 
special cause variation, which should be checked. (See [10].) 


Inferences about Mean Vectors When Some Observations Are Missing 251 


Control Regions for Future Subsample Observations 


Once data are collected from the stable operation of a process, they can be used to 
set control limits for future observed subsample means. 

If X is a future subsample mean, then X — X has a multivariate normal distrib- 
ution with mean 0 and 


Cov(X ~ X) = Cov(X) + Lcov(x, = MtYs 
ov(X ~ X) = Cov(X) + — Cov(X,) = ~— 
Consequently, 
nm = to-] = 
areas. X)S7(X — X) 
is distributed as 
(nm — n)p 


Fy,nm-n-ptl 


Control Ellipse for Future Subsample Means. The prediction ellipse for a future 
subsample mean for p = 2 characteristics is defined by the set of all x such that 


_ (n+ 1)(m— 12 


a aa ae m(nm —n- 1) 


Fy nm-n-1(.05) (5-37) 
where, again, the right-hand side is usually approximated as x3(.05)/m. 


T?-Chart for Future Subsample Means. As before, we bring n/(n + 1) into the 
control limit and plot the quantity 


T? = m(X — X)'S"(X — X) 
for future sample means in chronological order. The upper control limit is then 


(1 + 1)(m— 1)p 


UCL = 
e (nm —n- p +1) 


Fy.nm-n-p+ 1(.05) 
The UCL is often approximated as y7(.05) when nis large. 

Points outside of the prediction ellipse or above the UCL suggest that the cur- 
rent values of the quality characteristics are different in some way from those of the 
previous stable process. This may be good or bad, but almost certainly warrants a 
careful search for the reasons for the change. 


5.7 Inferences about Mean Vectors When Some 
Observations Are Missing 


Often, some components of a vector observation are unavailable. This may occur be- 
cause of a breakdown in the recording equipment or because of the unwillingness of 
a respondent to answer a particular item on a survey questionnaire. The best way to 
handle incomplete observations, or missing values, depends, to a large extent, on the 


252 Chapter 5 Inferences about a Mean Vector 


experimental context. If the pattern of missing values is closely tied to the value of 
the response, such as people with extremely high incomes who refuse to respond ina 
survey on salaries, subsequent inferences may be seriously biased. To date, no statist}- 
cal techniques have been developed for these cases. However, we are able to treat sit- 
uations where data are missing at random—that is, cases in which the chance 
mechanism responsible for the missing values is not influenced by the value of the 
variables. 

A general approach for computing maximum likelihood estimates from incom- 
plete data is given by Dempster, Laird, and Rubin [5]. Their technique, called the 
EM algorithm, consists of an iterative calculation involving two steps. We call them 
the prediction and estimation steps: 


1. Prediction step. Given some estimate 6 of the unknown parameters, predict 
the contribution of any missing observation to the (complete-data) sufficient 
Statistics. 

2. Estimation step. Use the predicted sufficient statistics to compute a revised 
estimate of the parameters. 


The calculation cycles from one step to the other, until the revised estimates do 
not differ appreciably from the estimate obtained in the previous iteration. 

When the observations X,, X2,..., X, are a random sample from a p-variate 
normal population, the prediction-estimation algorithm is based on the complete- 
data sufficient statistics [see (4-21)] 


n 
T, _ >> X; = nX 
and 


n 
T, = >) X)Xj = (n - 1)8 + nXX’ 
j=l 


In this case, the algorithm proceeds as follows: We assume that the population mean 
and variance—m and %, respectively—are unknown and must be estimated. 


Prediction step. For each vector x; with missing values, let x! ) denote the miss- 


he components and x? ) denote ftione components which are available. Thus, 
= (1)r o (2) 
foxy 
Given estimates gf and ¥ from the estimation step, use the mean of the ener 
tional normal distribution of x“), given x), to estimate the missing values. That is,! 


x) = edits E) = AY + EF, S4(x? - A) (5-38) 


estimates the contribution of x! : to T;. 
Next, the predicted contribution of x!” to T, is 


xf) = E(x$)x' Ix Oi -2) = Subp th + 3 xx pe (5-39) 


'If all the components x, are missing, set ¥j = je and ZX; = E+ pep’. 


Inferences about Mean Vectors When Some Observations Are Missing 253 


and 
xf" = EKO KE, ,¥) = x) (2 


The contributions in (5-38) and (5-39) are summed over all x; with h missing compo- 
nents. The results are combined with the sample data to yield T, and Tp. 


Estimation step. Compute the revised maximum likelihood estimates (see Result 4.11): 
pat 21% - aw (5-40) 


We illustrate the computational aspects of the prediction—estimation algorithm 
in Example 5.13. 


Example 5.13 (illustrating the EM algorithm) Estimate the normal population mean 
yp and covariance & using the incomplete data set 


5 
X=1 , 


Here n = 4, p = 3, and parts of observation vectors x, and x4 are missing. 
We obtain the initial sample averages 


eee ig a OT 2T NG ~  3+6+2+5 _ 4 
By 2 > h2 3 ; 3 = = ager ~ 


from the available observations. Substituting these averages for any missing values, 
so that ¥,,; = 6, for example, we can obtain initial covariance estimates. We shall 
construct these estimates using the divisor n because the algorithm eventually pro- 
duces the maximum likelihood estimate © Thus, 


2 (6 — 6)? + (7 ~ 6)? + (5 ~ 6)" + (6 — 6) 1 
a 4 2 
a 1 es 5 
22 ~ 5° 335 5 
we (6 6)(0 = 1) 07 = 6) = 1) 4S 6) 9) + (6 =.6)( 0 = 1) 
012 =~ 4 
aun 
~ 4 
Bs = 5, G3 = 1 


The prediction step consists of using the initial estimates fi and = to predict the 
contributions of the missing values to the sufficient statistics T, and T5. [See (5-38) 
and (5-39).] 


254 Chapter 5 Inferences about a Mean Vector 


The first component of x, is missing, so we partition #@ and ¥ as 


fet pgoy gf RF Ge) (SES, 
BM=|p2/= | Say) 2 =| 4121422 O23 | = < ; — 
3 G13'%23 33 ai} 222 


and predict 


ge oh oie ell ees 4Ifo-1 

mv fi + S| 2 P| = 6-[ iif | ez = 5.73 
ane 
5 H + (5.73)? = 32.99 
2 


Xy1(%12, X13] = Xis[%12, X13] =5.73[0, 3] = [0, 17.18] 


X13 — 3 


I 
Ser hs x wae ee 1 2 
X71) = Fy. ~ Yy2h22%e1 + XT = 3 - F 1] E 
4 


For the two missing components of x4, we partition i and ¥ as 


Hy a) i O11 12} 13 Sq 55 
~ ~ ~ ™ 1o~ =>. , [ aatese eahetaecesy 
B= |=] say) 2% =} O12 922) G23 | = Pe 

rel bas i ee oe Lai Zo2 

B3 913 923 | 733 : 


and predict 


[2] 


Hy ae ~ 
B + ¥12%93 (x43 ~ 3) 


x 
w 
1 
_ 
® 
bn 
~S ” 
u 


for the contribution to T,. Also, from (5-39), 


—— 


x} X41%42 2 
41 41X42 E X41 X41X42 
—~ T a 2 
Xyikan X42 X4qiXq2 X42 


(t 
ee 
AP oe 
NIP ae 
_ 

! 
——1 
PIs 
| | 

— 
Bia 
pues 
A 
— 

— 

lw 

+ 

= aA 
Ww 

Ld 
a 

a 

an 

Ww 

= 


_ | 41.06 8.27 
~ | 827 1.97 


Inferences about Mean Vectors When Some Observations Are Missing 255 


are the contributions to T,. Thus, the predicted complete-data sufficient statistics 


are 
¥iy + X21 + X31 + X41 §.73+74+5+64 24.13 
T, = | X12 + X22 + X32 ay X42 =, 04241413 /= 4.30 
X13 + X23 + X33 + X43 3464245 16.00 


~ —~ 


2 2 2 2 
XU + XQ, + 43,4 X44 


‘h= RyaX12 + X21X22 + X31%32 + Xqi¥a2 X42 + x42 + x42 + x40 

; q1Xq3 + Xp1X03 + X51X33 + K4yXqy Xp2x13 +.Xy2XqQ3 + 32X33 +X47%ay X73 +193 4+235 +x43 
32.99 + 7? + 5? + 41.06 

0 + 7(2) + 5(1) + 8.27 0? + 274+ 2 + 1.97 

17.18 + 7(6) + 5(2) + 32 0(3) + 2(6) + 1(2) + 65 32 4 62 +22 4 5? 


148.05 27.27 101.18 
27.27 6.97 20.50 
101.18 2050 74.00 


% 
We ok War cr. oa ean pe be, - nar 


This completes one prediction step. 
The next estimation step, using (5-40), provides the revised estimates” 


i 24.13 6.03 
aon 1 =4] 430 | =| 1.08 
16.00 4.00 
ee Se, 
%=-T,- pA 
148.05 27.27 101.18 6.03 
=| 27.27 697 20.50 | — | 1.08 | [6.03 1.08 4.00] 
101.18 20.50 74.00 4.00 
61 33 117 
=| 33 59 83 
117 83 2.50 


Note that ¢,; = .61 and @2 = .59 are larger than the corresponding initial esti- 
mates obtained by replacing the missing observations on the first and second vari- 
ables by the sample means of the remaining values. The third variance estimate 733 
remains unchanged, because it is not affected by the missing components. 

The iteration between the prediction and estimation steps continues until the 
elements of #t and & remain essentially unchanged. Calculations of this sort are 
easily handled with a computer. = 


2 The final entries in ¥ are exact to two decimal places. 


256 Chapter 5 Inferences about a Mean Vector 


Once final estimates #2 and & are obtained and relatively few missing compo. 
nents occur in X, it seems reasonable to treat 


all x such that n(a — w)'S(@ - w) = y2(a) (5-41) 


as an approximate 100(1 — a)% confidence ellipsoid. The simultaneous confidence’ 
statements would then follow as in Section 5.5, but with x replaced by g and S re. 
placed by 2. . 


Caution. The prediction-estimation algorithm we discussed is developed on the 
basis that component observations are missing at random. If missing values are re- 
lated to the response levels, then handling the missing values as suggested may in. 
troduce serious biases into the estimation procedures. Typically, missing values are 
related to the responses being measured. Consequently, we must be dubious of any 
computational scheme that fills in values as if they were lost at random. When more 
than a few values are missing, it is imperative that the investigator search for the sys- 
tematic causes that created them. 


5.8 Difficulties Due to Time Dependence in Multivariate 
Observations 


For the methods described in this chapter, we have assumed that the multivariate 
observations X;, X2,..., X,, Constitute a random sample; that is, they are indepen- 
dent of one another. If the observations are collected over time, this assumption 
may not be valid. The presence of even a moderate amount of time dependence 
among the observations can cause serious difficulties for tests, confidence regions, 
and simultaneous confidence intervals, which are all constructed assuming that in- 
dependence holds. 

We will illustrate the nature of the difficulty when the time dependence can be 
represented as a multivariate first order autoregressive [AR(1)] model. Let the 
p X 1 random vector X, follow the multivariate AR(1) model 


X, — p = O(K,_; - 4) + «, (5-42) 


where the ¢, are independent and identically distributed with E[e,) = 0 and 
Cov(é,) = &, and all of the eigenvalues of the coefficient matrix ® are between —1 
and 1. Under this model Cov(X,, X,_,) = ®’%, where 


co 
y= D> 0" 

j=0 
The AR(1) model (5-42) relates the observation at time ¢, to the observation at time 
t — 1, through the coefficient matrix ©. Further, the autoregressive model says the 
observations are independent, under multivariate normality, if all the entries in the 
coefficient matrix ® are 0. The name autoregressive model comes from the fact that 
(5-42) looks like a multivariate version of a regression with X, as the dependent 
variable and the previous value X,_., as the independent variable. 


Difficulties Due to Time Dependence in Multivariate Observations 257 


As shown in Johnson and Langeland [8], 


42 1 n _ so, ° 
XB, $= > (% — X)(x, - X) > ix 


n-1Ay 


where the arrow above indicates convergence in probability, and 


Cov(” pa x,) a — @) Sy + ¥x(I - &')* - Ey (5-43) 


t=1 


Moreover, for large n, Vn (XK — ys) is approximately normal with mean 0 and covari- 
ance matrix given by (5-43). 

To make the calculations easy, suppose the underlying process has ® = ¢I 
where |¢| < 1. Now consider the large sample nominal 95% confidence ellipsoid 
for p. 


{all » such that n(X — w)'S"'(X — pw) < x3,(.05)} 


This ellipsoid has large sample coverage probability .95 if the observations are inde- 
pendent. If the observations are related by our autoregressive model, however, this 
ellipsoid has large sample coverage probability 


Pix, = (1 — $)(1 + 6) "x5 (.05)] 


Table 5.10 shows how the coverage probability is related to the coefficient @ and the 
number of variables p. 

According to Table 5.10, the coverage probability can drop very low, to .632, 
even for the bivariate case. 

The independence assumption is crucial, and the results based on this assump- 
tion can be very misleading if the observations are, in fact, dependent. 


Table 5.10 Coverage Probability of the Nominal 95% Confidence 
Ellipsoid 


Supplement 


SIMULTANEOUS CONFIDENCE 
INTERVALS AND ELLIPSES AS SHADOWS 
OF THE p-DIMENSIONAL ELLIPSOIDS 


We begin this supplementary section by establishing the general result concerning 
the projection (shadow) of an ellipsoid onto a line. 


Result 5A.1. Let the constant c > 0 and positive definite p x p matrix A deter- 
mine the ellipsoid {z:z’A''z < c”}. For agiven vector u ¥ 0, and z belonging to the 
ellipsoid, the 


Erojection (hatow) of are u’Au 
{z'A'z < c*}onu uu 


which extends from 0 along u with length c V w’Au/u'u. When u is a unit vector, the 
shadow extends cVu'Au units, so |z'u| <cVu'Au. The shadow also extends 
cV u' Au units in the —u direction. 


Proof. By Definition 2A.12, the projection of any z on wis given by (z’u) u/u'u. Its 

squared length is (z’ u)?/u' u. We want to maximize this shadow over all z with 
way <= c*. The extended Cauchy-Schwarz inequality in (2-49) states that 

(b’d)* < (b’Bd) (d’B"'d), with equality when b = KB™'d. Setting b = z, d = u, 

and B = A“!, we obtain 

(u’u) (length of projection)? = (zu) < (z'A'z) (u' Au) 


<cwAu  forallz:z'A'z s c* 


The choice z = cAu/Vu' Au yields equalities and thus gives the maximum shadow, 
besides belonging to the boundary of the ellipsoid. That is, z’A"!'z = c’u' Au/u' Au 
= c’ for this z that provides the longest shadow. Consequently, the projection of the 


258 


Simultaneous Confidence Intervals and Ellipses as Shadows of the p-Dimensional Eilipsoids 259 


ellipsoid on uis cV u' Au u/u’u, and its length is c'V uw’ Au/u’u. With the unit vector 
e, = u/Vu'h, the projection extends 


Vcre Ae, = Jax Vu' Au units along u 
uu 


The projection of the ellipsoid also extends the same length in the direction —u. 


Result 5A.2. Suppose that the ellipsoid {z:2'A‘'z <c?} is given and that 
U = (u, | uw] is arbitrary but of rank two. Then 


zin the ellipsoid iiinlies that for all U, U’z is in the ellipsoid 
basedon Atandc?f "pues tna based on (U'AU) "and c? 
or 
z’A 2 <c? impliesthat (U’z)'(U’AU)'(U'z) sc? forallU 
Proof. We first establish a basic inequality. Set P = A’?U(U’AU)?U'A'”, 
‘where A = AYA)? Note that P = P’ and P? = P, so (I— P)P’ = P-— P? = 0. 


Next, using A? = A’?A/?, we write z’A'z = (A‘/z)'(A%z) and A’?z 
= PA, + (1 — P)A?z. Then 


vA?z = ( Az)! ( A122) 
= (PA'z + (1 ~ P)Az)' (PA Yz + (1 P)AYz) 
= (PA '2)' (PAz) + ((I - P)AD?2)' (I — P)A*z) 
= 1 APP PAN, = 7 APA 2g = U(U'AU) U'z— (SA-1) 


Since z’A‘!z = c? and U was arbitrary, the result follows. a 


Our next result establishes the two-dimensional confidence ellipse as a projection 
of the p-dimensional ellipsoid. (See Figure 5.13.) 


Figure 5.13 The shadow of the 
ellipsoid z’A“‘z = c? on the 
U1, U, plane is an ellipse. 


260 Chapter 5 Inferences about a Mean Vector 


Projection on a plane is simplest when the two vectors u, and uy determinin 
the plane are first converted to perpendicular vectors of unit length. (See 
Result 2A.3.) ; 


Result 5A.3. Given the ellipsoid {z:z'A‘'z < c*} and two perpendicular unit 
vectors u, and uy, the projection (or shadow) of {z'A'z <= c”} on the uu 

plane results in the two-dimensional ellipse {(U’z)' (U’AU)"(U'z) < c*}, wher ; 
U = (uy | up]. 2 


Proof. By Result 2A.3, the projection of a vector z on the uy, uy plane is 


uz 


(ujz) uy + (ubz)u. = [u, | uw] | = UU'z 
2 


The projection of the ellipsoid {z:2’A'z < c*} consists of all UU'z with— 
z'A'z < c?. Consider the two coordinates U'z of the projection U(U'z). Let z be- 
long to the set {z:z'A 'z = c?} so that UU'z belongs to the shadow of the ellipsoid. 
By Result 5A.2, 


(U'z)'(U’AU)' (U'z) s c? 


so the ellipse {(U'z)' (U' AU)” (U’z) < c*} contains the coefficient vectors for the 


shadow of the ellipsoid. : 
Let Ua be a vector in the u;, uz plane whose coefficients a belong to the ellipse 
{a'(U'AU) ‘a = c’}. If we set z = AU(U’AU) “a, it follows that 


U'z = U’AU(U'AU) "a = a 
and 
z'Az = a'(U'AU)!U' AA 'AU(U’AU) ‘a = a’(U'AU) a S C2 
Thus, U’z belongs to the coefficient vector ellipse, and z belongs to the ellipsoid 


z'A'z < c*. Consequently, the ellipse contains only coefficient vectors from the 
projection of {z:z'A‘z = c’} onto the u;, u; plane. re) 


Remark. Projecting the ellipsoid z’A"1z < c’ first to the u;, uz plane and then to 
the line u, is the same as projecting it directly to the line determined by uy. In the 
context of confidence ellipsoids, the shadows of the two-dimensional ellipses give 
the single component intervals. 


Remark. Results SA.2 and 5A.3 remain valid if U = [w,...,u,] consists of 
2 <q p linearly independent columns. 


Exercises 


Exercises 261 


3.1. 


5.2. 


3.3. 


3.4. 


5.5. 


35.6. 


5.7. 


(a) Evaluate T’, for testing Hp: 2’ = [7, 11], using the data 


2 12 
8 9 
Kale 9 
8 10 


(b) Specify the distribution of T? for the situation in (a). 
(c) Using (a) and (b), test Hp at the a = .05 level. What conclusion do you reach? 


Using the data in Example 5.1, verify that 7? remains unchanged if each observation 
x;,j = 1,2,3;, is replaced by Cx;, where : 


eL] 


Note that the observations 


yield the data matrix 


(6-9) (10-6) (8-3)]' 
(6+ 9) (10+6) (8 +3) 


(a) Use expression (5-15) to evaluate T? for the data in Exercise 5.1. 
(b) Use the data in Exercise 5.1 to evaluate A in (5-13). Also, evaluate Wilks’ lambda. 


Use the sweat data in Table 5.1. (See Example 5.2.) 


(a) Determine the axes of the 90% confidence ellipsoid for yx. Determine the lengths of 
these axes. 

(b) Construct Q-Q plots for the observations on sweat rate, sodium content, and 
potassium content, respectively. Construct the three possible scatter plots for pairs 
of observations. Does the multivariate normal assumption seem justified in this 
case? Comment. 


The quantities x, S, and S~! are given in Example 5.3 for the transformed microwave- 
radiation data. Conduct a test of the null hypothesis Ho: yx’ = [.55, .60] at the a = .05 
level of significance. Is your result consistent with the 95% confidence ellipse for pe pic- 
tured in Figure 5.1? Explain. 


Verify the Bonferroni inequality in (5-28) for m = 3. 
Hint: A Venn diagram for the three events C,,C2, and C3 may heip. 


Use the sweat data in Table 5.1 (See Example 5.2.) Find simultaneous 95% T? confi- 
dence intervals for 41,22, and 43 using Result 5.3. Construct the 95% Bonferroni inter- 
vals using (5-29). Compare the two sets of intervals. 


262 Chapter 5 Inferences about a Mean Vector 


5.8. 


5.9. 


5.10. 


From (5-23), we know that T? is equal to the largest squared univariate r-valy 
constructed from the linear combination a’x; with a = S~’(X — mo). Using the 
results in Example 5.3 and the Hp in Exercise 5.5, evaluate a for the transformed 
microwave-radiation data. Verify that the r-value computed with this a is equal to T2 
in Exercise 5.5. 


Harry Roberts, a naturalist for the Alaska Fish and Game department, studies grizaj 
bears with the goal of maintaining a healthy population. Measurements onn = 61 bes 
provided the following summary statistics (see also Exercise 8.23): 


Variable Weight Body Neck Girth Head Head 
(kg) length (cm) (cm) length width 


(cm) (cm) (cm) 
Sample 
mean x 95.52 16438 5569 9339 17.98 31.13 
Covariance matrix 


3266.46 1343.97 731.54 1175.50 162.68 238.37 
1343.97 721.91 324.25 537.35 80.17 117.73 
731.54 32425 179.28 281.17 39.15 56.80 
1175.50 53735 281.17 474.98 63.73 94.85 
162.68 80.17 39.15 63.73 9.95 13.88 
238.37 117.73 56.80 94.85 13.88 21.26 


(a) Obtain the large sample 95% simultaneous confidence intervals for the six popula- 
tion mean body measurements. 

(b) Obtain the large sample 95% simultaneous confidence ellipse for mean weight and 
mean girth. 

(c) Obtain the 95% Bonferroni confidence intervals for the six means in Part a. 

(d) Refer to Part b. Construct the 95% Bonferroni confidence rectangle for the mean 
weight and mean girth using m = 6. Compare this rectangle with the confidence 
ellipse in Part b. 

(e) Obtain the 95% Bonferroni confidence interval for 


mean head width ~ mean head length 


using m = 6 + | = 7 to allow for this statement as well as statements about each 
individual mean. 

Refer to the bear growth data in Example 1.10 (see Table 1.4). Restrict your attention to 

the measurements of length. 

(a) Obtain the 95% T? simultaneous confidence intervals for the four population means 

’ for length. 

(b) Refer to Part a. Obtain the 95% T? simultaneous confidence intervals for the three 
successive yearly increases in mean length. 

(c) Obtain the 95% T’ confidence ellipse for the mean increase in length from 2 to 3 
years and the mean increase in length from 4 to S years. 


Exercises 263 


(d) Refer to Parts a and b. Construct the 95% Bonferroni confidence intervals for the 
set consisting of four mean lengths and three successive yearly increases in mean 
length. 

(e) Refer to Parts c and d. Compare the 95% Bonferroni confidence rectangle for the 
mean increase in length from 2 to 3 years and the mean increase in length from 4 to 
5 years with the confidence ellipse produced by the T?-pracedure. 


5.11. A physical anthropologist performed a mineral analysis of nine ancient Peruvian hairs, 
The results for the chromium (x,) and strontium (x2) levels,in parts per million (ppm), 
were as follows: 


x(Cr) 48 4053 2.19 5S 74 66 93 37 22 


x2(St) | 1257 73.68 11.13 20.03 20.29 .78 4.64 43 1.08 


Source: Benfer and others, “Mineral Analysis of Ancient Peruvian Hair,” American ; 
Journal of Physical Anthropology, 48, no. 3 (1978), 277-282. 


It is known that low levels (less than or equal to .100 ppm) of chromium suggest the 
presence of diabetes, while strontium is an indication of animal protein intake. 


(a) Construct and plot a 90% joint confidence ellipse for the population mean vector 
bw’ = [41,42], assuming that these nine Peruvian hairs represent a random sample 
from individuals belonging to a particular ancient Peruvian culture. 

(b) Obtain the individual simultaneous 90% confidence intervals for 41 and pz by “pro- 
jecting” the ellipse constructed in Part a on each coordinate axis. (Alternatively, we 
could use Result 5.3.) Does it appear as if this Peruvian culture has a mean strontium 
level of 10? That is, are any of the points (4, arbitrary, 10) in the confidence regions? 
Is [.30,10]’ a plausible value for yz? Discuss. 

(c) Do these data appear to be bivariate normal? Discuss their status with reference to 
Q-Q plots and a scatter diagram. If the data are not bivariate normal, what implica- 
tions does this have for the results in Parts a and b? 


(d) Repeat the analysis with the obvious “outlying” observation removed. Do the infer- 
ences change? Comment. 


5.12. Given the data 


with missing components, use the prediction-estimation algorithm of Section 5.7 to 
estimate yx and &. Determine the initial estimates, and iterate to find the first revised 
estimates. 


5.13. Determine the approximate distribution of ~n In(| = \/| EoI) for the sweat data in 
Table 5.1. (See Result 5.2.) 


5.14. Create a table similar to Table 5.4 using the entries (length of one-at-a-time f-interval)/ 
(length of Bonferroni t-interval). 


264 Chapter5 Inferences about a Mean Vector 


Exercises 5.15, 5.16, and 5.17 refer to the following information: 


Frequently, some or all of the population characteristics of interest are in the form of 
attributes. Each individual in the population may then be described in terms of the 
attributes it possesses. For convenience, attributes are usually numerically coded with re. 
spect to their presence or absence. If we let the variable X pertain to a specific attribute, 
then we can distinguish between the presence or absence of this attribute by defining 


x= 1 if attribute present 
0 if attribute absent 


In this way, we can assign numerical values to qualitative characteristics. 

When attributes are numerically coded as 0-1 variables, a random sample from the 
population of interest results in statistics that consist of the counts of the number of 
sample items that have each distinct set of characteristics. If the sample counts are 
large, methods for producing simultaneous confidence statements can be easily adapted 
to situations involving proportions. 

We consider the situation where an individual with a particular combination of 
attributes can be classified into one of g + 1 mutually exclusive and exhaustive 
categories. The corresponding probabilities are denoted by p,, P2,.-., Pg» Pq+1- Since 
the categories include all possibilities, we take pg+; =1—- (p) + pz + +++ + pg). An 
individual from category k will be assigned the ((q + 1) X 1) vector value [0,...,0, 
1,0,..., 0]/with 1 in the kth position. 

The probability distribution for an observation from the population of individuals in 
q + 1 mutually exclusive and exhaustive categories is known as the multinomial distrib- 
ution. It has the following structure: 


Category 1 2 k q q+1 
1 0 0 0 0 
0 1 ; 0 0 
0 0 0 0 0 
Outcome (value) : 1 : 
0 0 
: 1 
0 0 0 0 1 
Probability q 
(proportion) Pr P2 °°? Peo“ Pg Patt = 1 - 2 Pi 
i= 
Let X;,j = 1,2,...,, be a random sample of size n from the multinomial 
distribution. : 


The kth component, Xj,, of X; is 1 if the observation (individual) is from category k 
and is 0 otherwise. The random sample X,,X2,...,X, can be converted to a sample 
proportion vector, which, given the nature of the preceding observations, is a sample 
mean vector. Thus, 


Pq+i Pqti 


5.15. 


5.16. 


Exercises 265 


and 
P11 O12 *°* Oi g+l 
a 1 1 1) oa 022° F2g4i 
Cov(p) = —Cov(X,) = -~X =— : : . af, 
n a : 3 : 
Figtl F2%qtl1 “°° Fqti,gtl 


For large n, the approximate sampling distribution of p is provided by the central limit 
theorem. We have 


Vn(p—p) isapproximately N(0,¥) 


where the elements of ¥ are Ok = Px(1 — px) and oj, = —p;p,. The normal approx- 
imation remains valid when o;, is estimated by G;, = p,(1 — p,) and o;, is estimated 
by Gig = —PiPx, i # k. 

Since each individual must belong to exactly ome category, X4+1,j = 
1 — (Xj + Xo; +++ + X Qj), 80 Pyar = 1 — (Pi + fy + +++ + py), and asa result, & 
has rank q. The usual inverse of ¥ does not exist, but it is still possible to develop simul- 
taneous 100(1 — a@)% confidence intervals for all linear combinations a’ p. 


Result. Let X,,X2,...,X,, be a random sample from a q + 1 category multinomial 
distribution with P[X;, = 1] = px, k = 1,2,...,q9 + 1,j = 1,2,...,n. Approximate 


simultaneous 100(1 — a@)% confidence regions for all linear combinations a'p 
= @)py + appr + +++ + g+4Pg+1 are given by the observed values of 


apt Vita) 


provided that n — q is large. Here p = (1/n) 5. Xj, and © = {6;,} isa(g+ 1) X (q +1) 
j=l 


matrix with Gy, = Py(1 — py) and Gi, = —pipy, i # k. Also, x3(a@) is the upper 
(100a)th percentile of the chi-square distribution with g d.£ = 


In this result, the requirement that n — q is large is interpreted to mean np, is 
about 20 or more for each category. 

We have only touched on the possibilities for the analysis of categorical data. Com- 
plete discussions of categorical data analysis are available in [1] and [4]. 


Let Xj; and X;, be the ith and kth components, respectively, of X ;. 
(a) Show that 4; = E(Xj;) = pjando,; = Var(X;;) = p{1 — p;),i = 1,2,.--,p. 


(b) Show that o;, = Cov(Xj;, Xjx) = — pip, i # k. Why must this covariance neces- 
sarily be negative? 


As part of a larger marketing research project, a consultant for the Bank of Shorewood 
wants to know the proportion of savers that uses the bank’s facilities as their primary ve- 
hicle for saving. The consultant would also like to know the proportions of savers who 
use the three major competitors: Bank B, Bank C, and Bank D. Each individual contact- 
ed in a survey responded to the following question: 


266 Chapter 5 Inferences about a Mean Vector 


Which bank is your primary savings bank? 


No 
Savings 


Another 
Bank 


Bank of 


Shorewood Bank 2 


Response: Bank B } Bank C 


A sample of n = 355 people with savings accounts produced the following counts; 
when asked to indicate their primary savings banks (the people with no savings will be: 


ignored in the comparison of savers, so there are five categories): 


be et cbisivabbnd RUA la dtacn.... 


Sekt 


alc 


iss 


Bank (category) | Bank of Shorewood BankB BankC BankD Another bank 3s 
Observed Total 
number 105 119-56 25 50 ; = 36h 
ulation Pi P2 P3 Ps Ps=1- ‘ 
Porroportion (Pit Pot Pst Pa) i 


Observed sample 
proportion - _ 105 _ 


= = Do =. h=. D, = OF. pe = 14 
Pi = 365 30 Pp2 = 33 py = 16 py Ps 


— 


_ 
Let the population proportions be 
P; = proportion of savers at Bank of Shorewood 
Pz = proportion of savers at Bank B 
P3 = proportion of savers at Bank C 
Ps = proportion of savers at Bank D 
1 — (p, + pz + p3 + Ps) = proportion of savers at other banks 


(a) Construct simultaneous 95% confidence intervals for p;, p2,--+» Ps- 


(b) Construct a simultaneous 95% confidence interval that allows a comparison of the 
Bank of Shorewood with its major competitor, Bank B. Interpret this interval. 


5.17. In order to assess the prevalence of a drug problem among high school students ina 
particular city, a random sample of 200 students from the city’s five high schools 
were Surveyed. One of the survey questions and the corresponding responses are 
as follows: 


What is your typical weekly marijuana usage? 


Category 


None Moderate Heavy — 
(1-3 joints) | (4 or more joints) 
Number of 
responses 117 62 


a 
fart 


Exercises 267 


Construct 95% simultaneous confidence intervals for the three proportions p,, P2, and 
P3 = 1-~ (Pp, + py). 


The following exercises may require a computer. 


5.18. Use the college test data in Table 5.2. (See Example 5.5.) 


(a) Test the null hypothesis Ho: xx’ = [500, 50, 30] versus Hy: 2’ # [500, 50, 30] at the 
a = .05 level of significance. Suppose [500, 50, 30]’ represent average scores for 
thousands of college students over the last 10 years. Is there reason to believe that the 
group of students represented by the scores in Table 5.2 is scoring differently? 
Explain. ‘ 

(b) Determine the lengths and directions for the axes of the 95% confidence ellipsoid for pe. 


(c) Construct Q-Q plots from the marginal distributions of social science and history, 
verbal, and science scores. Also, construct the three possible scatter diagrams from 
the pairs of observations on different variables. Do these data appear to be normally 
distributed? Discuss. 


5.19. Measurements of x, = stiffness and x, = bending strength for asample of n = 30 pieces 
of a particular grade of lumber are given in Table 5.11. The units are pounds/(inches)’. 
Using the data in the table, 


Table 5.11 Lumber Data 


xy X2 xy x2 
(Stiffness: (Stiffness: 

modulus of elasticity) (Bending strength) modulus of elasticity) (Bending strength) 

1232 4175 1712 7749 

1115 6652 1932 6818 

2205 7612 1820 9307 

1897 10,914 1900 6457 

1932 10,850 2426 10,102 

1612 7627 1558 7414 

1598 6954 1470 7556 

1804 8365 1858 7833 

1752 9469 1587 8309 

2067 6410 2208 9559 

2365 10,327 1487 6255 

1646 7320 2206 10,723 

1579 8196 2332 5430 

1880 9709 2540 12,090 

1773 10,370 2322 10,072 
Source: Data courtesy of U.S. Forest Products Laboratory. 


(a) Construct and sketch a 95% confidence ellipse for the pair [y1,"2]', where 
#1 = E(X,) and pz = E(X)). 

(b) Suppose x19 = 2000 and 429 = 10,000 represent “typical” values for stiffness and 
bending strength, respectively. Given the result in (a), are the data in Table 5.11 con- 
sistent with these values? Explain. 


268 Chapter 5 Inferences about a Mean Vector 


(c) Is the bivariate normal distribution a viable population model? Explain with refer. . 

ence to @-Q plots and a scatter diagram. ; 

5.20: A wildlife ecologist measured x, = tail length (in millimeters) and x. = wing Jength (in % 
millimeters) for a sample of n = 45 female hook-billed kites. These data are displayed ip ; 
Table 5.12. Using the data in the table, : 


Table 5.12 Bird Data 


nal x2 
(Tail (Wing 


x, i. 
“(Tail (Wing 
length) length) 


x1 
(Tail 
length) length) length) 


186 266 173 . 271 
197 285 194 220) 
201 295 198 300 
190 282 180 272 
209 305 190 292 
187 285 191 286 
207 297 196 285 
178 268 207 286 
202 271 209 303 
205 285 179 261 
190 280 186 262 
189 277 174 245 
21h 310 181 250 
216 305 189 262 
188 
Source: Data courtesy of S. Temple. 


(a) Find and sketch the 95% confidence ellipse for the population means y, and p. 
Suppose it is known that 4; = 190 mm and p2, ~ 275 mm for male hook-billed 
kites. Ate these plausible values for the mean tail length and mean wing length for 
the female birds? Explain. 

(b) Construct the simultaneous 95% T?-intervals for 1, and j2z and the 95% Bonferroni 
intervals for gs; and 42. Compare the two sets of intervals. What advantage, if any, do 
the T2-intervals have over the Bonferroni intervals? 

(c) Is the bivariate normal distribution a viable population model? Explain with 
reference to Q-@ plots and a scatter diagram. 

5.21. Using the data on bone mineral content in Table 1.8, construct the 95% Bonferroni 
intervals for the individual means. Also, find the 95% simultaneous T?-intervals. 
Compare the two sets of intervals. 

5.22. A portion of the data contained in Table 6.10 in Chapter 6 is reproduced in Table 5.13. 
These data represent various costs associated with transporting milk from farms to dairy 


plants for gasoline trucks. Only the first 25 multivatiate observations for gasoline trucks 
are given, Observations 9 and 21 have been identified as outliers from the full data set of 


36 observations. (See [2].) 


Exercises 269 


Table 5.13 Milk Transportation-Cost Data 


Fuel (x;) Repair (x) Capital (x3) 

16.44 12.43 11.23 
2.70 3.92 
1.35 9.75 
5.78 7.78 
5.05 10.67 
5.78 9.88 
10.98 10.60 
14.27 _ 9.45 
15.09 3.28 
7.61 10.23 
5.80 8.13 
3.63 9.13 
5.07 10.17 
6.15 7.61 
14.26 14.39 
2.59 6.09 
6.05 12.14 
2.70 12.23 
7.73 11.68 
14.02 12.01 
17.44 16.89 
8.24 7.18 
13.37 17.59 
10.78 14.58 


5.16 17.00 


(a) Construct Q-Q plots of the marginal distributions of fuel, repair, and capital costs. 
Also, construct the three possible scatter diagrams from the pairs of observations on 
different variables. Are the outliers evident? Repeat the Q-Q plots and the scatter 
diagrams with the apparent outliers removed. Do the data now appear to be nor- 
mally distributed? Discuss. 


(b) Construct 95% Bonferroni intervals for the individual cost means. Also, find the 
95% T?-intervals. Compare the two sets of intervals. 


5.23. Consider the 30 observations on male Egyptian skulls for the first time period given in 
Table 6.13 on page 349. 


(a) Construct Q~Q plots of the marginal distributions of the maxbreath, basheight, 
baslength and nasheight variables. Also, construct a chi-square plot of the 
multivariate observations. Do these data appear to be normally distributed? 
Explain. 


(b) Construct 95% Bonferroni intervals for the individual skull dimension variables. 
Also, find the 95% T?-intervals. Compare the two sets of intervals. 


5.24, Using the Madison, Wisconsin, Police Department data in Table 5.8, construct individual 
X charts for x3 = holdover hours and x, = COA hours. Do these individual process 
characteristics seem to be in control? (That is, are they stable?) Comment. 


270 Chapter 5 Infcrences about a Mean Vector 


5.25. Refer to Exercise 5.24. Using the data on the holdover and COA overtime hours, con. 
struct a quality ellipse and a T?-chart, Does the process represented by the Divariate. 
observations appear to be in control? (That is, is it stable?) Comment. Do you leary’ 
something from the multivariate control charts that was not apparent in the individya] 
X -charts? ; 


5.26. Construct a 7?-chart using the data on x, = legal appearances overtime hours, 
X2 = extraordinary event overtime hours, and x; = holdover overtime hours from 
Table 5.8. Compare this chart with the chart in Figure 5.8 of Example 5.10. Does plotting 
T? with an additional characteristic change your conclusion about process Stability? 
Explain. 


5.27. Using the data on x3 = holdover hours and x, = COA hours from Table 5.8, construct. 
a prediction ellipse for a future observation x’ = (x3, x4). Remember, a prediction: 
ellipse should be calculated from a stable process Interpret the result. 


5.28 As part of a study of its sheet metal assembly process, a major automobile manufacturer 
uses sensors that record the deviation from the nominal thickness (millimeters) at six Jo- 
cations On a car. The first four are measured when the car body is complete and the last. 
two are measured on the underbody at an earlier stage of assembly. Data on 50 cars are 
given in Table 5.14. 

(a) The process seenis stable for the first 30 cases. Use these cases to estimate § and x, 
Then construct a T? chart using all of the variables. Include all 50 cases, 


(b) Which individual locations seem to show a cause for concern? 


5.29 Refer to the car body data in Exercise 5.28. These are all measured as deviations from 
target value so it is appropriate to test the null hypothesis that the mean vector is zero. 
Using the first 30 cases, test Ho: uw = Oat a = 05 


5.30 Refer to the data on energy consumption in Exercise 3.18. 

(a) Obtain the large sample 95% Bonferroni confidence intervals for the mean con- 
sumption of each of the four types, the total of the four, and the difference, petrole- 
um minus natural gas. 

(b) Obtain the large sample 95% simultaneous T°? intervals for the mean consumption 
of each of the four types, the total of the four, and the difference, petroleum minus 
natural gas. Compare with your results for Part a. 


5.34 Refer to the data on snow storms in Exercise 3.20. 
(a) Find a 95% confidence region for the mean vector after taking an appropriate trans- 
formation. 
(b) On the same scale, find the 95% Bonferroni confidence intervals for the two compo- 
nent means. 


Exercises 271 


TABLE 5.14 Car Body Assembly Data 


Index Xy X2 
1 —0.12 0.36 
2 —0.60 -0.35 
3 -0.13 0.05 
4 ~0.46 -0.37 
5 —0.46 —0.24 
6 ~—0.46 —0.16 
7 —0.46 -0.24 
8 —0.13 0.05 
9 -0.31 —0.16 

10 -0.37 -0.24 
11 ~1.08 —0.83 
42 —0.42 —0.30 
13 -0.31 0.10 
14 —-0.14 0.06 
15 -0.61 —0.35 
16 —0.61 -0.30 
17 —0.84 —0.35 
18 ~0.96 —0.85 
19 —0.90 —0.34 

20 —0.46 0.36 

21 —0.90 -0.59 

22 -0.61 —0.50 

23 —0.61 -0.20 

24 ~—0.46 —0.30 

25 —0.60 —0.35 

26 —0.60 —0.36 
27 —0.31 0.35 
28 —0.60 —0.25 
29 —0.31 0.25 
30 —0.36 —0.16 
31 —0.40 —0.12 

32 ~0.60 —0.40 

33 —0.47 —0.16 

34 ~—0.46 —0.18 

35 —0.44 -0.12 

36 —0.90 —0.40 

37 ~0.50 -0.35 

38 —0.38 0.08 

39 —0.60 -0.35 

40 0.11 0.24 

41 0.05 0.12 

—O.11 0.24 


X3 Xq X5 X6 
0.40 0.25 1.37 —0.13 
0.04 ~—0.28 —0.25 -—0.15 
0.84 0.61 1.45 0.25 
0.30 0.00 —0.12 —0.25 
0.37 0.13 0.78 -0.15 
0.07 0.10 1.15 —0.18 
0.13 0.02 0.26 —0.20 

—0.01 0.09 —-0.15 —0.18 
—0.20 0.23 0.65 0.15 
0.37 0.21 1.15 0.05 
—0.81 0.05 0.21 0.00 
0.37 —0.58 0.00 —0.45 
—0.24 0.24 0.65 0.35 
0.18 —0.50 1.25 0.05 
—0.24 0.75 0.15 —0.20 
0.20 —0.21 —0.50 —0.25 
—0.14 —0.22 1.65 —0.05 
0.19 -0.18 1.00 —0.08 
—0.78 —0.15 0.25 0.25 
0.24 —0.58 0.15 0.25 
0.13 0.13 0.60 —0.08 
—0.34 ~—0.58 0.95 —0.08 
—0.58 —0.20 1.10 0.00 
—0.10 —0.10 0.75 —0.10 
—0.45 0.37 1.18 ~—0.30 
-0.34 —0.11 1.68 —0.32 
—-0.45 —0.10 1.00 —0.25 
~0.42 0.28 0.75 0.10 
—0.34 —0.24 0.65 0.10 
0.15 —0.38 1.18 —0.10 
—0.48 —0.34 0.30 —0.20 
~0.20 0.32 0.50 0.10 
—0.34 —0.31 0.85 0.60 
0.16 0.01 0.60 0.35 
-0.20 —0.48 1.40 0.10 
0.75 -0.31 0.60 —0.10 
0.84 —0.52 0.35 —0.75 
0.55 —0.15 0.80 ~0.10 
—0.35 —0.34 0.60 0.85 
0.15 0.40 0.00 —0.10 
0.85 0.55 1.65 -0.10 
0.50 0.35 0.80 —0,21 
—-0.10 —0.58 1,85 -0.11 
0.75 —0.10 0.65 -0.10 
0.13 0.84 0.85 0.15 
0.05 0.61 1.00 0.20 
0.37 —0.15 0.68 0.25 
—0.10 0.75 0.45 0.20 
0.37 —0.25 1.05 0.15 
—0.05 —0.20 1.21 0.10 


Source: Data Courtesy of Darek Ceglarek. 


272 Chapter 5 Inferences about a Mean Vector 


References 


Te Py Lr a ae Oe WG Ve eae ae Meare Oo Ep ee en ee 


1. 
2. 


10. 


ion 


Agresti, A. Categorical Data Analysis (2nd ed.), New York: John Wiley, 2002. 


Bacon-Sone, J, and W. K. Fung. “A New Graphical Method for Detecting Single and 
Multiple Outliers in Univariate and Multivariate Data.” Applied Statistics, 36, no, 2 


(1987), 153-162. 


. Bickel, P. J., and K. A. Doksum. Mathematical Statistics: Basic Ideas and Selected Topics. . 


Vol. I (2nd ed.}, Upper Saddle River, NJ: Prentice Hall, 2000. 


. Bishop, Y. M. M., S. E. Feinberg, and P. W. Holland. Discrete Muftivariate Analysis: Theory . 


and Practice (Paperback). Cambridge, MA: The MIT Press, 1977. 


. Dempster, A. P., N. M. Laird, and D. B. Rubin. “Maximum Likelihood from Incomplete : 


Data via the EM Algorithm (with Discussion).” Journal of the Royal Statistical Society 
(B), 39, no. 1 (1977), 1-38. 


. Hartley, H. O. “Maximum Likelihood Estimation from Incomplete Data.” Biometrics, 14 


(1958), 174-194. 


. Hartley, H. O., and R. R. Hocking. “The Analysis of Incomplete Data.” Biometrics, 27 


(1971), 783-808. 


. Johnson, R. A. and T. Langeland “A Linear Combinations Test for Detecting Seria! Cor- 


Telation in Multivariate Samples.” Topics in Statistical Dependence. (1991) Institute of 
Mathematical Statistics Monograph, Eds. Block, H. et al., 299-313. 

Johnson, R.A. and R. Li “Multivariate Statistical Process Control Schemes for Control- 
ling a Mean.” Springer Handbook of Engineering Statistics (2006), H. Pham, Ed. 
Springer, Berlin. 

Ryan, T. P. Statistical Methods for Quality Improvement (2nd ed.). New York: John Wiley, 
2000. 

Tiku, M. L., and M. Singh. “Robust Statistics for Testing Mean Vectors of Multivariate 
Distributions.” Communications in Statistics—Theory and Methods, 11, no. 9 (1982), 
985-1001. 


COMPARISONS OF SEVERAL 
MULTIVARIATE MEANS 


6.1 


Introduction 


The ideas developed in Chapter 5 can be extended to handle problems involving the 
comparison of several mean vectors. The theory is a little more complicated and 
rests on an assumption of multivariate normal distributions or large sample sizes. 
Similarly, the notation becomes a bit cumbersome. To circumvent these problems, 
we Shall often review univariate procedures for comparing several means and then 
generalize to the corresponding multivariate cases by analogy. The numerical exam- 
ples we present will help cement the concepts. 

Because comparisons of means frequently (and should) emanate from designed 
experiments, we take the opportunity to discuss some of the tenets of good experi- 
mental practice. A repeated measures design, useful in behavioral studies, is explicitly 
considered, along with madifications required to analyze growth curves. 

We begin by considering pairs of mean vectors. In later sections, we discuss sev- 
eral comparisons among mean vectors arranged according to treatment levels. The 
corresponding test statistics depend upon a partitioning of the total variation into 
pieces of variation attributable to the treatment sources and error. This partitioning 
is known as the multivariate analysis of variance (MANOVA). 


6.2 Paired Comparisons and a Repeated Measures Design 


Paired Comparisons 


Measurements are often recorded under different sets of experimental conditions 
to see whether the responses differ significantly over these sets. For example, the 
efficacy of a new drug or of a saturation advertising campaign may be determined by 
comparing measurements before the “treatment” (drug or advertising) with those 


273 


274 Chapter 6 Comparisons of Several Multivariate Means 


7 
3 
= 


after the treatment. In other situations, two or more treatments can be administereg 
to the same or similar experimental units, and responses can be compared to asse. 
the effects of the treatments. 3 

One rational approach to comparing two treatments, or the presence ang abs 
sence of a single treatment, is to assign both treatments to the same ar identical unit 
(individuals, stores, plots of land, and so forth). The paired responses may then 
analyzed by computing their differences, thereby stamens much of the influen 
of extraneous unit-te-unit variation. 

In the single response (univariate) case, let X; ie dente the response 
treatment 1 (or the response before treatment), and let X;2 denote the response tag 
treatment 2 (or the response after treatment) for the jth trial. That is, (X is Xj2)a 
are measurements recorded on the jth unit or jth pair of like units. By design, thes 
n differences 5 


ES 


D; = Xj, ~ Xj2, Pe h2pecctt (6- 


should reflect only the differential effects of the treatments. 
Given that the differences D, in (6-1) represent independent observations from: 
an (6, o3) distribution, the variable 


sa/Vn (6-2) 


where 


=~ le< me m2 
D=— %D, and caer e LG aa, (6-3) 


has a t-distribution with n — 1 d.f. Consequently, an a-level test of 


Hy:5 = 0 (zero mean difference for treatments) 
versus 
yy: 6+#0 


may be conducted by comparing |r| with 1,-;(a/2)—the upper 100(@/2)th per- 
centile of a t-distribution with n — 1 d.f. A 100(1 - a)% confidence interval for the 
mean difference 6 = E(X;, — X,2) is provided the statement 


d- tn-1(a/2)— = <8<d+h,-;( (6-4) 


Sa 
a/2)—= 
?) Van 
(For example, see [11].) 

Additional notation is required for the multivariate extension of the paired- 
comparison procedure. It is necessary to distinguish between p responses, two treat- 
ments, and n experimental units. We label the p responses within the jth unit as 

Xj, = variable 1 under treatment 1 
Xj = variable 2 under treatment 1 


Xj» = variable p under treatment 1 


Xj; = variable 1 under treatment 2 
Xj. = variable 2 under treatment 2 


Xj, = variable p under treatment 2 


Paired Comparisons and a Repeated Measures Design 275 


and the p paired-difference random variables become 
Diy = Xj ~ Xj 


Pa = ete = X22 (6-5) 
Djp = Xi jp ~ Xrjp 
Let Dj = [Dj1, Dj2,---, Dp], and assume, for j = 1, 2,..., n, that 
5 
6 
E(D;) = 6 =| | and Cov(D,) = X4 (6-6) 
8p 
If, in addition, D,,D2,...,D, are independent N,(6, 2,4) random vectors, infer- 
ences about the vector oft mean differences 6 can be based upon a T*-statistic. 
Specifically, 
T? = n(D — &)'S;'(D - 6) (6-7) 
where 
—_ 1 n 1 n 
D=— > D, and ae (D; — D)(D; - D)’ (6-8) 


j=1 
Result 6.1. Let the differences D,,D2,...,D, be a random sample from an 
N,(6, 24) population. Then 

T? = n(D — &)'S7\(D —- 6) 
is distributed as an [(n — 1)p/(n — p)]F,,n-, random variable, whatever the true 6 
and Xq. 


If n and n — p are both large, T? is approximately distributed as a x, random 
variable, regardless of the form of the underlying population of differences. 


Proof. The exact distribution of T’ is a restatement of the summary in (5-6), with 
vectors of differences for the observation vectors. The approximate distribution of 
T’, for n andn — p large, follows from (4-28). = 


The condition 6 = 0 is equivalent to “no average difference between the two 
treatments.” For the ith variable, 5; > 0 implies that treatment 1 is larger, on aver- 
age, than treatment 2. In general, inferences about 6 can be made using Result 6.1. 


Given the observed differences dj = [dj1, dj2,.--, dp], j =1,2,...,n, corre- 
sponding to the random variables in (6-5), an a-level test of Ho: 6 = 0 versus 
H,: 6 # 0 for an N,(6, £4) population rejects Hp if the observed 


-1 
(n — 1)p F, 
(n — p) 
where F, ,-,(a@) is the upper (100a)th percentile of an F-distribution with p 
and n — pd. Here d and S, are given by (6-8). 


T? = nd’S;'d > n—p(@) 


276 Chapter 6 Comparisons of Several Multivariate Means 


A 100(1 ~ a)% confidence region for 6 consists of all & such that 


iz ee: n-1 
(a ~ 5) Sd - 5) = ae Fence) (6-9) 


Also, 100(1 —_a)% simultaneous confidence intervals for the individual mean 


differences 6; are given by 
(n — 1)p 54, 
(n _ P) Fyn-p(@) ae (6-10) 


where d; is the ith element of d.and 84, is the ith diagonal element of S,. 

For n — p large, [(n - 1)p/(m — p)]¥p.n—-p(a@) = x3(@) and normality 
need not be assumed. 

The Bonferroni 100(1 — a)% simultaneous confidence intervals for the 
individual mean differences are 


z 
3 a Sa; 
8B: d,+ tail 2) es (6-10a) 
where ¢,,.;(a/2p) is the upper 100(a/2p)th percentile of a t-distribution with 


n-I1df. 


Example 6.1 (Checking for a mean difference with paired observations) Municipal 
wastewater treatment plants are required by Jaw to monitor their discharges into 
rivers and streams on a regular basis. Concern about the reliability of data from one 
of these self-monitoring programs Jed to a study in which samples of effluent were 
divided and sent to two Jaboratories for testing. One-half of each sample was sent to 
the Wisconsin State Laboratory of Hygiene, and one-half was sent to a private com- 
mercial Jaboratory routinely used in the monitoring program. Measurements of bio- 
chemical oxygen demand (BOD) and suspended solids (SS) were obtained, for 
n = 11 sample splits, from the two laboratories. The data are displayed in Table 6.1. 


Table 6.1 Effluent Data 
Commercial lab | State lab of hygiene 
Sample j | x,;:(BOD) — xj;2 (SS) X2j1 (BOD)  x2;2 (SS) 
1 6 27 
2 6 23 
3 18 64 
4 8 44 
5 11 30 
6 34 75 
7 28 26 
8 71 124 
9 43 54 
10 33 30 
11 20 
Source: Data courtesy of S. Weber. 


Paired Comparisons and a Repeated Measures Design 277 


Do the two laboratories’ chemical analyses agree? If differences exist, what is 
their nature? 

The T?-statistic for testing Ho: 6’ = [8,, 8] = [0,0] is constructed from the 
differences of paired observations: 


dj, = Xyj1 — Xr —19 -—22 —18 -27 —-4 —10 -14 17 9 4 —19 
d, = X1j2 ~ 42; | 12 10 42 15 -1 11 —4 60 -2 10 -7 


d 


ro d,|_ | —9.36 5, = | 199-26 88.38 
d, 13.27 |’ qd 88.38 418.61 


—.0012 | | -9.3 
T? = 11{-9.36, 13.27] E - os rea = 13.6 


and 


Taking « = .05, we find that [p(n — 1)/(n — p)]F,.n—p(.05) = [2(10)/9]F, 9(.05) 
= 9.47. Since T? = 13.6 > 9.47, we reject Hp and conclude that there is a nonzero 
mean difference between the measurements of the two laboratories. It appears, 
from inspection of the data, that the commercial lab tends to produce lower BOD 
measurements and higher SS measurements than the State Lab of Hygiene. The 
95% simultaneous confidence intervals for the mean differences 5, and 52 can be 
computed using (6-10). These intervals are 


ae (n-—1)p a 7 199.26 
8:4) +4] Ga py Fennel) of t= 936 & VIAT I 


or (—22.46, 3.74) 
418.61 


6:13.27 + V9.47 11 


er (—5.71, 32.25) 


The 95% simultaneous confidence intervals include zero, yet the hypothesis Hp: 6 = 0 
was rejected at the 5% level. What are we to conclude? 

The evidence points toward real differences. The point 6 = 0 falls outside 
the 95% confidence region for & (see Exercise 6.1), and this result is consistent 
with the T?-test. The 95% simultaneous confidence coefficient applies to the 
entire set of intervals that could be constructed for all possible linear com- 
binations of the form 4,5, + a6). The particular intervals corresponding to the 
choices (a, = 1, a, = 0) and (a, = 0,a, = 1) contain zero. Other choices of a 
and a, will produce simultaneous intervals that do not contain zero. (If the 
hypothesis Hp: & = 0 were not rejected, then all simultaneous intervals would 
include zero.) 

The Bonferroni simultaneous intervals also cover zero. (See Exercise 6.2.) 


278 Chapter 6 Comparisons of Several Multivariate Means 


Our analysis assumed a normal distribution for the D,;. In fact, the situation ic: is: 
further complicated by the presence of one or, possibly, two outliers. (See Exercise: 
6.3.) These data can be transformed to data more nearly normal, but with such a 
small sample, it is difficult to remove the effects of the outlier(s). (See Exercise 6.4, ): 

The numerical results of this example illustrate an unusual circumstance thar? 
can occur when.making inferences. nt 


The experimenter in Example 6.1 actually divided a sample by first shaking it and: 
then pouring it rapidly back and forth into two bottles for chemical analysis. This was’ 
prudent because a simple division of the sample into two pieces obtained by pouring: 
the top half into one bottle and the remainder into another bottle might result in more: 
suspended solids in the lower half due to setting. The two laboratories would then not: 
be working with the same, or even like, experimental units, and the conclusions would: 
not pertain to laboratory competence, measuring techniques, and so forth. 

Whenever an investigator can control the assignment of treatments to experi- 
mental units, an appropriate pairing of units and a randomized assignment of treat. 
ments can enhance the statistical analysis. Differences, if any, between supposedly 
identical units must be identified and most-alike units paired. Further, a random as- 
signment of treatment 1 to one unit and treatment 2 to the other unit will help elim. 
inate the systematic effects of uncontrolled sources of variation. Randomization can 
be implemented by flipping a coin to determine whether the first unit in a pair re- 
ceives treatment 1 (heads) or treatment 2 (tails). The remaining treatment is then 
assigned to the other unit. A separate independent randomization is conducted for 
each pair. One can conceive of the process as follows: 


Experimental Design for Paired Comparisons 


3 n 


1 2 
| 
eooe 
Like pairs of be eh 


units [] [| (= aes 
t i f t 


Treatments Treatments Treatments Treatments 
1 and 2 1 and 2 fand2 °©¢® Jand2 
assigned assigned assigned assigned 

at random at random at random at random 


We conclude our discussion of paired comparisons by noting that dand S,, and 
hence T*, may be calculated from the full-sample quantities X and S. Here x is the 
2p X 1 vector of sample averages for the p variables on the two treatments given by 


X= [Xy1, Xn2,---, Kip, X21, X225-- +s Xap] (6-11) 
and S$ is the 2p X 2p matrix of sample variances and covariances arranged as 


Si: Sy 
§ a | (XP) (exp) (6-12} 


S21 Soe 
(pXp)  (pxp) 


Paired Comparisons and a Repeated Measures Design © 279 


The matrix S,, contains the sample variances and covariances for the p variables on 
treatment 1. Similarly, $2 contains the sample variances and covariances computed 
for the p variables on treatment 2. Finally, S,, = $5, are the matrices of sample 
covariances computed from observations on pairs of treatment 1 and treatment 2 
variables, 

Defining the matrix 


1 0 Oi-1 0 0 
ei ey fee 8 (6-13) 
(pX2p) e So i ; : * : 
0 0 1: 0 0 -1 
i 
(p + 1)st column 
we can verify (see Exercise 6.9) that 
d; = Cx;, j=1,2,....2 
a fe (6-14) 
d=Cx and S, = CSC’ 
Thus, 
T? = nx'C'(CSC’) "CX (6-15) 
and it is not necessary first to calculate the differences d,, d2,...,d,,. On the other 


hand, it is wise to calculate these differences in order to check normality and the as- 
sumption of a random sample. 

Each row ce; of the matrix C in (6-13) is a contrast vector, because its elements 
sum to zero. Attention is usually centered on contrasts when comparing treatments. 
Each contrast is perpendicular to the vector 1’ = {1,1,...,1]} since cj1 = 0. The 
component 1’x;, representing the overall treatment sum, is ignored by the test 
statistic 7? presented in this section. 


A Repeated Measures Design for Comparing Treatments 


Another generalization of the univariate paired 1-statistic arises in situations where 
q treatments are compared with respect to a single response variable. Rach subject 
or experimental) unit receives each treatment once over successive periads of time. 
The jth observation is 


where X;; is the response to the ith treatment on the jth unit. The name repeated 
measures stems from the fact that alj treatments are administered to each unit. 


280 Chapter 6 Comparisons of Several Multivariate Means 


For comparative purposes, we consider contrasts of the components of 


p = E(X;). These could be : 
My — Be Po-t 0 + Of] wy 4 

Mi~ pws} _ |) 0 -1 + Ol} mw =C¢ “3 

: “Te : ane : > | 7 Sie “ 

1 ~ Hg 1 0 0 + -1 J LéMg : 

or ye 
Ma by -1 10 0 0) fy Zs 

fi cn] 

3-H. |_| O -1 1 0 O]) we} _ c 3 

: = : m4 e? 8 > | 7 \2e : 

Mg ~ Bg-1 0 00 -;-- -1 1 bg ; 


Both C, and C, are called contrast matrices, because their g — 1 rows are linearly 

independent and each is a contrast vector. The nature of the design eliminates much. 
of the influence of unit-to-unit variation on treatment comparisons. Of course, the: 
experimenter should randomize the order in which the treatments are presented to 

each subject. 

When the treatment means are equal, C;y2 = Cz = 0. In general, the hypoth- 
esis that there are no differences in treatments (equal treatment means) becomes 
Cy = 0 for any choice of the contrast matrix C. 

Consequently, based on the contrasts Cx, in the observations, we have means 
Cx and covariance matrix CSC’, and we test Cu = 0 using the T*-statistic 


T? = n(Cx)'(CSC') Cx 


Test for Equality of Treatments in a Repeated Measures Design 


Consider an N,{ 2, %) population, and let C be a contrast matrix. An a-level test 
of Hy: Cu = 0 (equal treatment means) versus H,: Cy # 0 is as follows: 
Reject A if 
(n ~ 1)(q - 1) 
(n-q+1) 
where F,—1,,-9+1(a) is the upper (100a)th percentile of an F-distribution with 
q—-1andn — q+ 14d. Here x and S are the sample mean vector and covari- 
ance matrix defined, respectively, by 


2 = n(Cx)'(CSC’) ‘Cx > — Fg-1n-gt i (@) (6-16) 


aoe and S = pe x)(x; — x)’ 


It can be shown that T? does not depend on the particular choice of C.! 


1Any pair of contrast matrices C; and C) must be related by C, = BC, with B nonsingular. . 
This follows because each C has the largest possible number, q — 1, of linearly independent rows, 
all perpendicular to the vector 1. Then (BC,)'(BC2SC3B’)"'(BC,) = C3B'(B')7'(C,SC}) 1B BC, = 
C4(C28C3)— ‘oe, so T? computed with C, or C; = BC? gives the same result. 


Paired Comparisons and a Repeated Measures Design 281 


A confidence region for contrasts Cy, with gs: the mean of a normal population, 
is determined by the set of all Cyz such that 


(n — 1)(q - 1) 
(n-q+1) 
where x and S are as defined in (6-16). Consequently, simultaneous 100(1 — «)% 


confidence intervals for single contrasts c' yz for any contrast vectors of interest are 
given by (see Result 5A.1) 


cx + (te Fy-1n—g+1(@) ese (6-18) 


n(Cx ~ Cu)'(CSC’) “(Cx — Cu) = — Fytsn-ge(@) (6-17) 


Example 6.2 (Testing for equal treatments in a repeated measures design) Improved 
anesthetics are often developed by first studying their effects on animals In one 
study, 19 dogs were initially given the drug pentobarbitol. Each dog was then ad- 
ministered carbon dioxide CO, at each of two pressure levels. Next, halothane (H) 
was added, and the administration of CO, was repeated. The response, milliseconds 
between heartbeats, was measured for the four treatment combinations: 


Present (4) (3) 


Halothane 


Absent (2) & 


Low High 
CO) pressure 


Table 6.2 contains the four measurements for each of the 19 dogs, where 
Treatment 1 = high CO, pressure without H 
Treatment 2 = low CO) pressure without H 
Treatment 3 = high CO) pressure with H 
Treatment 4 = low CQ) pressure with H 
We shall analyze the anesthetizing effects of CO, pressure and halothane from 
this repeated-measures design. 
There are three treatment contrasts that might be of interest in the experiment. 


Let p11, 42, #3, and 4 Correspond to the mean responses for treatments 1, 2,3, and 
4, respectively. Then 


Halothane contrast representing the 
(43 + ws) — (41 + #2) = | difference between the presence and 
absence of halothane 


CO, contrast representing the a) 


(a1 + Hs) ~ (Ha + Ma) = ( between high and low CO, pressure 


Contrast representing the influence 
(441 + wa) ~ (42 + #3) = | Of halothane on CO, pressure differences 
(H-CO, pressure “interaction” ) 


282 Chapter 6 Comparisons of Several Multivariate Means 


Table 6.2 Sleeping-Dog Data 


Treatment 
2 3 4 
J 426 609 556 600 
2 253 236 392 395 
3 359 433 349 357 
4 432 431 $22 600 
5 405 426 513 513 
6 324 438 507 539 
7 310 312 410 456 
8 326 326 350 504 
9 375 447 S47 548 
286 286 403 422 
349 382 473 497 
429 410 488 547 
348 377 447 514 
412 473 472 446 
347 326 455 468 
434 458 637 524 
364 367 432 469 
420 395 508 531 
397 556 645 625 


Source: Data courtesy of Dr. J. Atlee. 


With a’ = [41, M2, #3, Ag], the contrast matrix C is 


-1 -l 1 1 
C= 1 -1l 1 -l 
1 -1 -1 1 
The data (see Table 6.2) give 
368.21 2819.29 
—_ 404.63 and $= 3568.42 7963.14 
~ | 479,26 2943.49 5303.98 6851.32 
502.89 2295.35 4065.44 4499.63 4878.99 
It can be verified that 
209.31 9432.32 1098.92 927.62 
Cx = | ~60.05 |; CSC’ = } 1098.92 5195.84 914.54 
~12.79 927.62 914.54 7557.44 


and 
T? = n(Cx)'(CSC') (Cx) = 19(6.11) = 116 


Paircd Comparisons and a Repeated Measures Design 283 


With a = .05, 
(n ~ 1)(q - 1) _ 18(3) _ 18(3) _ 
e a q 1) Fg~1,n—q+(@) aod 6 F36(-05) = 6 032) i 10.94 


From (6-16), 7? = 116 > 10.94, and we reject Ho: Cu = 0 (no treatment effects). 
To see which of the contrasts are responsible for the rejection of Hp, we construct 
95% simultaneous confidence intervals for these contrasts. From (6-18), the 
contrast 


chm = (43 + Wa) — (4) + Wy) = halothane influence 


is estimated by the interval 


/18(3 [cis 9432.32 
(x3 + X4) — (41 + X2) + BO) (05) = = 209.31 + V10.94 ,/ 1 


= 209.31 + 73.70 


where ¢} is the first row of C. Similarly, the remaining contrasts are estimated by 


CO, pressure influence = (p41 + 43) — (2 + M4): 


— 60.05 + V10.94 ae = —60.05 + 54.70 
H-CO) pressure “interaction” = (j1 + 44) — (42 + 43): 
— 12.79 + V10.94 ae = ~12.79 + 65.97 


The first confidence interval] implies that there is a halothane effect. The pres- 
ence of halothane produces longer times between heartbeats. This occurs at both 
levels of CO, pressure, since the H-CQ, pressure interaction contrast, 
(1 + 44) — (42 — #3), is not significantly different from zero. (See the third 
confidence interval.) The second confidence interval indicates that there is an 
effect due to CO, pressure: The /ower CO) pressure produces longer times between 
heartbeats. 

Some caution must be exercised in our interpretation of the results because the 
trials with halothane must follow those without. The apparent H-effect may be due 
to a time trend. (Ideally, the time order of aff treatments should be determined at 
random.) om 


The test in (6-16) is appropriate when the covariance matrix, Cov(X) = &, 
cannot be assumed to have any special structure. If it is reasonable to assume that 
has a particular structure, tests designed with this structure in mind have higher 
power than the one in (6-16). (For & with the equal correlation structure (8-14), see 
a discussion of the “randomized block” design in [17] or [22}.) 


284 Chapter 6 Comparisons of Several Multivariate Means 


6.3 Comparing Mean Vectors from Two Populations 


| 


A T?-statistic for testing the equality of vector means from two multivariate popula 
tions can be developed by analogy with the univariate procedure. (See [11] fora di 
cussion of the univariate case.) This T?-statistic is appropriate for comparing™= 
responses from one'set of experimental settings (population 1) with independent ye_; 
sponses from another set of experimental settings (population 2). The compariso 
can be made without explicitly controlling for unit-to-unit variability, as in th 
paired-comparison case. 
If possible, the experimental units should be randomly assigned to the sets 9 
experimental] conditions. Randomization will, to some extent, mitigate the effect3 
of unit-to-unit variability in a subsequent comparison of treatments. Although some : 
precision is lost relative to paired comparisons, the inferences in the two-population : 
case are, ordinarily, applicable to a more general collection of experimental units ° 
simply because unit homogeneity is not required. 
- Consider a random sample of size 7, from population | and a sample of’ 
size m, from population 2. The observations on p variables can be arranged as” 


follows: 
Sample Summary Statistics 

r n n 
(Population 1) = -1i5 Gos 1 f nee aia 
X11. X12.---) Xin nt pet) 1 a ey 1) (x1; — %) 

ties id 
(Population 2) rn ue de 1 2 = _ 
X21,%22)---»X2n x2 = mo > x2; i mad > (x2; 2) (Xo; — X) 

ia i= 


In this notation, the first subscript—1 or 2—denotes the population. 
We want to make inferences about 


(mean vector of population 1) — (mean vector of population 2) = wa, — pr. 


For instance, we shall want to answer the question, Is 4; = m2 (or, equivalently, is 
1 — #2 = 0)? Also, if wy — 2 * 0, which component means are different? 
With a few tentative assumptions, we are able to provide answers to these questions. 


Assumptions Concerning the Structure of the Data 


1. The sample X11, X12,--.,X1n,, is a random sample of size 7; from a p-variate 
population with mean vector yz; and covariance matrix £1. 


2. The sample X21, X22,...,X2,5, 1s a random sample of size n, from a p-variate 
population with mean vector yz and covariance matrix 2. 


3. Also, X11, X12,.--, Xin, are independent of X21, Xp2,-.., X2n,- (6-19) 


We shall see later that, for large samples, this structure is sufficient for making 
inferences about the p X 1 vector w; — 42. However,when the sample sizes , and 
n are small, more assumptions are needed. 


Comparing Mean Vectors from Two Populations 285 


Further Assumptions When n, and nz Are Small 


1. Both populations are multivariate normal. 
2. Also, 2%, = Z2 (same covariance matrix). (6-20) 


The second assumption, that £; = %2, is much stronger than its univariate counter- 
part. Here we are assuming that several pairs of variances and covariances are 
nearly equal. 


7. 
1 
When E; = £2 = &, >) (x1; — X1) (x1; — X1)’ is an estimate of (nm, — 1)% and 
Ig fl 
Si (x2; — Xa) (x2) — X,)’ is an estimate of (m, — 1)%. Consequently, we can pool the 
j=l 
information in both samples in order to estimate the common covariance &. 


We set 
ay Ny 
—_= ina, , — _ , 
> (x17 — 1) 00; — 1)’ + SY (nj — 2) (x2) — %) 
Al | 


ny + nm — 2 
nm —1 m-1 
= S$, + 6-21 
mt+nm-2°) See tac! ( ) 


Spooted > 


Nog 


Since > (x1, — %1)(x1; — ¥1)' has m; — 1 df. and » (x2; — X2) (x2; — 2)’ has 


nz - 1 at, the divisor (nj — 1) + (m2 ~ 1) in (6-21) 4 is ‘obiained by combining the 
two component degrees of freedom. [See (4-24).] Additional support for the pool- 
ing procedure comes from-consideration of the multivariate normal likelihood. (See 
Exercise 6.11.) 

To test the hypothesis that 4; — #42 = 89, a specified vector, we consider the 
squared statistical distance from x; — Xx to 59. Now, 


E(X, — X2) = E(X1) — E(K2) = wi ~ we 


Since the independence assumption in (6-19) implies that X, and X, are indepen- 
dent and thus Cov(X, X2) = 0 (see Result 4.5), by (3-9), it follows that 


n2 ny 


Cov(X, — X,) = Cov(X,) + Cov(X,) = ~% + ee (2 + 1\s (6-22) 
n2 


Because Spooled eStimates Z, we see that 


1 1 
(2 + 4) Spooled 
is an estimator of Cov(X; ~— X>). 
The likelihood ratio test of 
Ho: #1 — B2 = 8 
is based on the square of the statistical distance, T?, and is given by (see [1]). 


Reject Ho if 
re ee led = A ne : 
T* = (&) — & — 80) | | — + — |Spoaed | (¥1 — ¥. — 69) > c? (6-23) 
m M2 


286 Chapter 6 Comparisons of Several Multivariate Means 


where the critical distance c’ is determined from the distribution of the (wo-samp] 2 
T?-statistic. 


Result 6.2. If X11, Xi2,--., Xin, is arandom sample of size n, from N,(#,,%) ang: 
Xz1, X22,---, Xan, is an independent random sample of size x2 from N,(M2, &), then 


= a : : 1 1 -1 = i 
T? = [KX — X. — (1 -— m)] (4 + 1) sts [K, — X2- (Hw - w»)] ~ 
is distributed as 3 
(ny + ny - 2)p i 
yaa ee F, aytny—p-\ we 
(ny tny- p-1y emrneP a) 
Consequently, 
a => ; 1 1 she ae > 
P) (X, — X2 — (a1 — #2)) rear? Spooied | (Xi — Xo — (Mi ~ M2) S| = 1a 
(6-24) 
where 
(mn, + nz — 2)p 
ee 
c (n, + ny ~ p- t) Fy nytny—p-(@) 
Proof, We first note that 
ae en 1 1 1 1 1 1 
= = os ee Sees jee, Soca ee Sita Sees 
X; - X Py Bi, + Pa X12 ans Xin, = X21 a X22 ny 
is distributed as 
1 1 
me — p2, (2 - +) 2) 
by Result 4.8, with ¢, = ¢, =---=¢,, = Uny and ¢y41 = Cpe. = 0° = Gyan = 


—1/n. According to (4-23), 
(nm, — 1); is distributed as W,, _;(%) and (mz — 1), as W,,,-1(2) 
By assumption, the X,;’s and the X),’s are independent, so (7, — 1)S, and 


(nz — 1)S2 are also independent. From (4-24), (m, — 1)S, + (nz — 1)S2 is then dis- 
tributed as W,,, +,.-2(%). Therefore, 


ree hae ee cS oe cee 
COS ae (X; — Xz ~ (#1 — 42)) Spootes eo (X — MK ~ (m1 — H2)) 


n2 


_ { multivariate normal \' ( Wishart random matrix \' / multivariate normal 
random vector df. random vector 

Wry+n,—2(&) 

my + ty —- 2 


= N,(0, 2) | - N,(0, &) 


which is the T?-distribution specified in (5-8), with n replaced by n, + nz — 1. [See 
(5-5) for the relation to F.] wl 


Comparing Mean Vectors from Two Populations 287 


We are primarily interested in confidence regions for 4; — 22. From (6-24), we 

conclude that all x, — m2 within squared statistical distance c? of X, — > constitute 
the confidence region. This region is an ellipsoid centered at the observed difference 
X; — X and whose axes are determined by the eigenvalues and eigenvectors of 
Spooled (or Srocted)- 
Example 6.3 (Constructing a confidence region for the difference of two mean vectors) 
Fifty bars of soap are manufactured in each of two ways. Two characteristics, 
X, = lather and X2 = mildness, are measured. The summary statistics for bars 
produced by methads 1 and 2 are 


_ [83 bale 
AUS Se oe ae 
5. 102 pcciled 
2] 397 a eae 


Obtain a 95% confidence region for pw, — #2. 
We first note that S, and S, are approximately equal, so that it is reasonable to 
pool them. Hence, from (6-21), 


49 49 2 1 
Spooled = Go 81 + Gg 82 = i ‘| 
Also, 


so the confidence ellipse is centered at [-1.9, .2]’. The eigenvalues and eigenvectors 
Of Syooted are obtained from the equation 
2-A 

1 =) 


so A = (7 + V49 — 36)/2. Consequently, A; = 5.303 and A, = 1.697, and the 
corresponding eigenvectors, e,; and e2, determined from 


— 


0 = |Spootea — AL] = =\’-74+9 


| 


A 


Spooled &: zm Aiki, i= 1,2 


Se eee are 
| ggg | Pn 2 aOR 


b 4 (98)(2) i 


since F),97(.05) = 3.1. The confidence ellipse extends 


are 


By Result 6.2, 


Vii (2+4)e- Vi; Vi5 


288 Chapter 6 Comparisons of Several Multivariate Means 


Hip 
2.0 [- 
{| "y 
mo Hy - Hay 
-1.0 1.0 : 
—1.0 Figure 6.1 95% confidence ellipse 


for 4, — #2. 


units along the eigenvector e;, or 1.15 units in the e; direction and .65 units in the e, 
direction. The 95% confidence ellipse is shown in Figure 6.1. Clearly, w4; — po = 0 
is not in the ellipse, and we conclude that the two methods of manufacturing soap 
produce different results. It appears as if the two processes produce bars of soap 
with about the same mildness (X2), but those from the second process have more 
lather (Xj). Py 


Simultaneous Confidence Intervals 


It is possible to derive simultaneous confidence intervals for the components of the 
vector #4] — #42. These confidence intervals are developed from a consideration of 
all possible linear combinations of the differences in the mean vectors. It is assumed 
that the parent multivariate populations are normal with a common covariance . 


Result 6.3. Let c? = [(m, + nm, — 2)p/(m, + m — p - 1) )Fp.nj+m-p-1(@). With 


probability 1 — a. 
"Uy yY ’ 1 1 
a (X, = X2) te,vfa m + i Spooted@ 


will cover a'(yt, — #2) for all a. In particular 4; — p22; will be covered by 


= = if 1 1 . 
(X1; — X2;) ze (2+ 4) sip fori = 1,2,...,p 


Proof. Consider univariate linear combinations of the observations 


Xi1,X12,---»X1,, and Xy,X29,---, Xan, 


given by a’X); = 4X1; + a2 Xj j2 a es a,X\ jp and a'X); = a, X25) + aX; 
+ +++ + a,X4;,. These linear combinations have sample means and covariances 
a’X,, a’S,a and a’X,, a’S,a, respectively, where X,, S,, and X2, S$, are the mean 
and covariance statistics for the two original samples. (See Result 3.5.) When bath 
parent populations have the same covariance matrix, sj}, = a’S,a and s3., = a’S,a 


Comparing Mean Vectors from Two Populations 289 


are both estimators of a’ Za, the common population variance of the linear combi- 
nations a’X, and a’X,. Pooling these estimators, we obtain 


(my ~ 1)sf.a + (m2 ~ 15) 


2 = 
Sa, pooled (ny +n) 2) 
ny- 1 nm-1 
=a’ 9 6-25 
| tots, + tats. ( ) 
= a'Spooled@ 


To test Hp: a'(#41 — #2) = a'Sp, on the basis of the a’X,; and a’X,,;, we can form 
the square of the univariate two-sample r-statistic 


2_ [a’(K, — Kp) — a'(wy — w2)I° = [a’(X — XK. -— (wi — my) 


? od 1 1 
ue = 2 eh (pena a 
(2 + nt a (2 + 1 Sat 


According to the maximization lemma with d = (X, — X2 — (mw, — m2)) and 
B = (1/n; + 1/n2)Spootea in (2-50), 


(6-26) 


y 
= (X, — X, ~ (mi — wy))’ Ie + 1 st (X, - X, - (uw; — #2)) 


ny 
= T? 
for alla # 0. Thus, 


(1 -a@) = P[T? <c*] = P[2<c’, foralla] 


1 1 
= P| Sc “(2 + © yee for aa | 


where c’ is selected according to Result 6.2. = 


a’(X, — X,) — a’(w1 — wr) 


Remark. For testing Ho: 4; — “2 = 0, the linear combination a'(x, — X2), with 
coefficient vector a & Spooted (1 — X,), quantifies the largest population difference. 
That is, if T? rejects Ho, then 4’(X, — X,) will have a nonzero mean. Frequently, we 
try to interpret the components of this linear combination for both subject matter 
and statistical importance. 


Example 6.4 (Calculating simultaneous confidence intervals for the differences in 
mean components) Samples of sizes ny = 45 and n = 55 were taken of Wisconsin 
homeowners with and without air conditioning, respectively. (Data courtesy of Sta- 
tistical Laboratory, University of Wisconsin.) Two measurements of electrical usage 
(in kilowatt hours) were considered. The first is a measure of total on-peak consump- 
tion (X,) during July, and the second is a measure of total off-peak consumption 
(X,) during July. The resulting summary statistics are 


z, =| 2044 g, = | 138253 23823.4 kige 
1" | 556.6 |’ 1 | 23823.4 73107.4 |’ d 


x, =| 130.0 5, = | 3632.0 19616.7 Boas kd 
2 | 355.0 |’ 2 | 19616.7  55964.5 |’ a 


290 Chapter 6 Comparisons of Several Multivariate Means 


(The off-peak consumption is higher than the on-peak consumption because there 
are more off-peak hours in a month.) 

Let us find 95% simultaneous confidence intervals for the differences in the 
mean components. 

Although there appears to be somewhat of a discrepancy in the sample vari- 
ances, for illustrative purposes we proceed to a calculation of the pooled sample co- 
variance matrix. Here 


P eS se Ae LS _ | 10963.7 21505.5 
pooled “ny trp — 2°) np tm —27 . | 215055 63661.3 
and 
(nm, + nz — 2)p 98(2) 
c= i Fr tg p-1(@) = 97 F2.97(-05) 


nt+tnm-~p-l 


(2.02) (3.1) = 6.26 


With i — 45 = [411 — #21» 12 — £22], the 95% simultaneous confidence inter- 
vals for the population differences are 


yy — Mor: (204.4 — 130.0) + V626 (4 + s) 10963.7 
or 
21.7 S py, — ba = 12711 -  (on-peak) 
My2 — £22: (556.6 aaa 355.0) + V6.26 (4 + 263661. 
or 


74.7 = py. — poo S 3285 (off-peak) 


We conclude that there is a difference in electrical consumption between those with 
air-conditioning and those without. This difference is evident in both on-peak and 
off-peak consumption. 
The 95% confidence ellipse for x; — fz is determined from the eigenvalue- 
eigenvector pairs A, = 71323.5, ej = [.336, .942] and Az = 3301.5, e5 = [.942, —.336]. 
Since 


1 
Vu (t+ tle = v7i3es (4 + x) 6.26 = 134.3 
nl My 45 55 
ih eee oer 
Ad —+—|ce= 3301.5 Sk 6.26 = 28.9 
2 nm * 45° 55 


we obtain the 95% confidence ellipse for 4; — pez sketched in Figure 6.2 on page 291. 
Because the confidence ellipse for the difference in means does not cover 0’ = [0,0], 
the T?-statistic will reject Ho: 41 — fp = Oat the 5% level. 


and 


Comparing Mean Vectors from Two Populations 291 
Hi - Ho 


300 


> yy,-p,, Figure 6.2 95% confidence ellipse for 
0 100 200 “BY BD = (iy) ~ Bai, H12 — B22): 


The coefficient vector for the linear combination most responsible for rejection 
is proportional to Syooted(X1 — X2). (See Exercise 6.7.) = 


The Bonferroni 100(1 — a)% simultaneous confidence intervals for the p popu- 
lation mean differences are 


ta = a 1 1 
Mii ~ Bait (Xi — X24) + inom =) (2 + +) Sii,pooled 


where tn, +ny-2(a/2p) is the upper 100(a/2p)th percentile of a t-distribution with 
ny + no 2 d.f. 


The Two-Sample Situation When &, # X, 


When %, # £,. we are unable to find a “distance” measure like T”, whose distribu- 
tion does not depend on the unknowns 2, and Xp. Bartlett's test [3] is used to test 
the equality of &, and 2, in terms of generalized variances. Unfortunately, the con- 
clusions can be seriously misleading when the populations are nonnormal. Nonnor- 
mality and unequal covariances cannot be separated with Bartlett’s test. (See also 
Section 6.6.) A method of testing the equality of two covariance matrices that is less 
sensitive to the assumption of multivariate normality has been proposed by Tiku 
and Balakrishnan [23]. However, more practical experience is needed with this test 
before we can recommend it unconditionally. 

We suggest, without much factual support, that any discrepancy of the order 
©1,:7 = 402,;;, OF vice versa, is probably serious. This is true in the univariate case. 
The size of the discrepancies that are critical in the multivariate situation probably 
depends, to a large extent, on the number of variables p. 

A transformation may improve things when the marginal variances are quite 
different. However, for n,; and n large, we can avoid the complexities due to 
unequal covariarice matrices. 


292 Chapter 6 Comparisons of Several Multivariate Means 


Result 6.4. Let the sample sizes be such that n, — pand m2 — pare large. Then, an : 
approximate 100(1 ~ a)% confidence ellipsoid for yx; - p22 is given by all p, ~ a 
satisfying 3 


[1 - %2 — (a wat [ts + 15.) [Ki ~ % — (m1 — #2)] S (a) 


where y3(a) is the upper (100«)th percentile of a chi-square distribution with pdf.’ 
Also, 100(1 — a)% simultaneous confidence intervals for all linear combinations” 
a’ (#2 — #2) are provided by 


ct, 1 1 . 
a’(1 — M2) belongsto a’(X; ~ 2) + V x(a) a'(2s, + 1s, . 
2 


Proof. From (6-22) and (3-9), 
E(X, — X2) = mw, - we 
and 


Cov(X; = X,) = Cov (Xj) a Cov (X) = hi + 2: 


By the central limit theorem, X; ~ X, is nearly N,[j; — #2, 7y'Z, + m2"%,]. 1, 
and 2, were known, the square of the statistical distance from X, — Xz to py; — wy 
would be 


nee {1 he = a 
[X, — X2 — (#1 ~ #2)] (45, + 13.) [X, — X2 — (a1 — #2)] 


This squared distance has an approximate X>distribution, by Result 4.7. When 7, and 
ny are large, with high probability, §, will be close to X, and S, will be close to 22. 
Consequently, the approximation holds with $, and S$; in place of X, and Xp, 
respectively. 

The results concerning the simultaneous confidence intervals follow from 


Result 5 A.1. m 


Remark. If 2, = 72 =, then (m — 1)/(n + n — 2) = 1/2,s0 
1 GaSe DS (1, 1) 


n n+n-2 n n 


1 
8. = 7 (Si + 8) = 


1 1 
= Syooled (2 + 1) 


With equal sample sizes, the large sample procedure is essentially the same as the 
procedure based on the pooled covariance matrix. (See Result 6.2.) In one dimen- 
sion, it is well known that the effect of unequal variances is least when n, = 72 and 
greatest when 4; is much less than ny or vice versa. 


oo + 
ny 


Comparing Mean Vectors from Two Populations 293 


Example 6.5 (Large sample procedures for inferences about the difference in means) 
We shall analyze the electrical-consumption data discussed in Example 6.4 using the 
large sample approach. We first calculate 


1 1 i Bay ee tee aol 


mt * nn? 45| 23923.4 7310741 * 351 19616.7 55964.5 


_ | 464.17 886.08 
~ | 886.08 2642.15 


The 95% simultaneous confidence intervals for the linear combinations 


a'(#, — m2) = [1,0] Bg eal = #11 — #21 
#12 — M22 


and 


, M11 — 421 
a’(#4,) — 2) = .1)| = #12 — B22 
#12 — 422 


are (see Result 6.4) 
My, — B21: 74.4 + V5.99 V464.17 or = (21.7, 127.1) 
#12 — #22: 201.6 + V5.99 V2642.15 or (75.8, 327.4) 


Notice that these intervals differ negligibly from the intervals in Example 6.4, where 
the pooling procedure was employed. The 7?-statistic for testing Hp: 4, — #2 = Vis 


ee Pe ae (ee 
T? = [%, — %] jis, 4s,| [X, — X2] 
__ | 204.4 — 130.0 |'| 464.17 886.08 |""| 204.4 — 130.0 
556.6 — 355.0 | | 886.08 2642.15 556.6 — 355.0 


59.874 —20.080 || 74.4 
= [74.4 201.6](10~ mia 
[74.4 201.6](10 oe a EY 


For a = .05, the critical value is y3(.05) = 5.99 and, since T? = 15.66 > x3(.05) 
= 5.99, we reject Ho. 

The most critical linear combination leading to the rejection of Hg has coeffi- 
cient vector 


1 Too \ ee _4,| 59.874 ~—20,080 74.4 
«(ts is.) i is an Le 


_ | 041 
.063 


The difference in off-peak electrical consumption between those with air condi- 
tioning and those without contributes more than the corresponding difference in 
on-peak consumption to the rejection of Hp: w; — 2 = 9. = 


294 Chapter 6 Comparisons of Several Multivariate Means 


A Statistic similar to 7* that is less sensitive to outlying observations for smalk: 
and moderately sized samples has been developed by Tiku and Singh [24]. However: 
if the sample size is moderate to large, Hotelling’s 7” is remarkably unaffected by: 
slight departures from normality and/or the presence of a few outliers. : 


An Approximation to the Distribution of T? for Normal 
Populations When Sample Sizes Are Not Large 


One can test Hg: #4; — #2 = 9 when the population covariance matrices are un-_ 
equal even if the two sample sizes are not large, provided the two populations are: 
multivariate normal. This situation is often called the multivariate Behrens-Fisher 

problem. The result requires that both sample sizes rn and mz are greater than p, the 

number of variables. The approach depends on an approximation to the distribution 

of the statistic 


a = 1 1 rn ie es ah 
T? = (KX, — XK. — (uw - wy |As, + +5,| (X — X2- (#; ~ #2)) (6-27) 


which is identical to the large sample statistic in Result 6.4. However, instead of 

using the chi-square approximation to obtain the critical value for testing Ho the 
recommended approximation for smaller samples (see [15] and [19}) is given by 
vp 

2 = v-p + qfow-pt1 (6-28) 

where the degrees of freedom v are estimated from the sample covariance matrices 


using the relation 
- ptp 
1 toy 1/1 1. \7)? 
Sin tC 8 a8) ) + (La a) ])} 
n; n; ny Na n; ny No 
where min(7, 72) S v Sm, + n>, This approximation reduces to the usual Welch 
solution to the Behrens-Fisher problem in the univariate (p = 1) case. 


With moderate sample sizes and two normal populations, the approximate level 
a test for equality of means rejects Hy: 4; — #2 = Oif 


(6-29) 


# 2 1 1 coe - v, 
(x1 — 2 — (wi - nay |, + +,| (X) — X) — (Hy; — B2)) > P 
ny nN2 Vv 


—p+ 1 fpv-p+i(@) 


where the degrees of freedom v are given by (6-29). This procedure is consistent 
with the large samples procedure in Result 6.4 except that the critical value x(q) is 


vp 
replaced by the larger constant fap p+i(). 


Similarly, the approximate 100(1 — @)% confidence region is given by all 
#4) — #2 Such that 
vp 


ee 1 ol te 
(X) — X) — (41 - yay (As, + +s,| (% — X2 — (#1 — #2)) Ss -——_F 
ny 15) v-p 


+1 p.v—p+i(&) 


(6-30) 


Comparing Mean Vectors from‘Two Populations 295 


For normal populations, the approximation to the distribution of 7? given by 
(6-28) and (6-29) usually gives reasonable results. 


Example 6.6 (The approximate T? distribution when £, # £,) Although the sample 
sizes are rather large for the electrical consumption data in Example 6.4, we use 
these data and the calculations in Example 6.5 to illustrate the computations leading 
to the approximate distribution of 7? when the population covariance matrices are 
unequal. 

We first calculate 


ite tt eee reel = nae a 
nm, 1 45|23823.4 73107.4 529.409 1624.609 
Ake cede: 8632.0 | a pee ee 
ny? 55 |19616.7 55964.5 356.667 1017.536 


and using a result from Example 6.5, 


1 1.7 59.874 ae 
—S,+— = (104 
[2 : is,| ae eae 10.519 
Consequently, 
1s iis + 1g i = 
ny 1 ny 1 2 = 
pee pea 59.874 ean 776 Hl 
529.409 1624.609 ~20.080 10519} |~.092 646 
and 
1 1 1. )?Y 776 -.060][ .776 ~.060 608 ~.085 
— S$, — 8S + —S§ = = 
min ny -.092 646]|-.092 646 ~131 423 
Further, 
1 1 1,77 
mai + a] = 
pe ae, io 59.874 Slee 2, 
356.667 1017.536 -—20.080 10519] [092 354 
and 


1 1 1.) 224 060] .224 .060 055.035 
—S, —S, + —S, = = 
1 np ~092 .354||-.092 354 053.131 


296 Chapter 6 Comparisons of Several Multivariate Means 


Then 


1 1 i ioe aa yy 
{u[(ts(es +28.) °)]+(ulés(2s.+4s)"|) 
ny ny ny nz ny ny My 


< 7 {(008 + 423) + (.776 + 646)?} = sre 


i Seer byt i iofi. tev 3 
Gras ti) J] (tetas) DY} 


= ae{(0ss + 131) + (224 + 354} = | i 


Using (6-29), the estimated degrees of freedom v is 
2+ 2? 


”= 678 + 0095 7° 
and the a = .05 critical value is 
vp _ 776X2 _ 1552 i 
oy rs Fy v=p+1C05) = 76-24 7 Far75-2+16-05) Siri 3.12 = 6.32 


From Example 6.5, the observed value of the test statistic is T? = 15.66 so the-:, 
hypothesis Ho: #, — 42 = Ois rejected at the 5% level. This is the same conclusi 
reached with the large sample procedure described in Example 6.5. 


As was the case in Example 6.6, the F, y-,+1 distribution can be defined with; 
noninteger degrees of freedom. A slightly more conservative approach is to use the: 
integer part of v. : 


6.4 Comparing Several Multivariate Population Means 
(One-Way MANOVA) 


Often, more than two populations need to be compared. Random samples, collected: 
from each of g populations, are arranged as 


Population I: xX, 1. X 125-5: Xin 
Population 2: X1,X72,.--,X2m (6-31) 7 


xX 


Population g: X41, Xpgo,---Xgn, 


MANOVA is used first to investigate whether the population mean vectors are the 
same and, if not, which mean components differ significantly. z 


Assumptions about the Structure of the Data for One-Way MANOV: 


L. X¢), Xe2,-.., X¢n,>is a random sample of size n, from a population with mean pt 
€ = 1,2,..., g. The random samples from different populations are independent™ 


Comparing Several Multivariate Population Means (One-way MANOVA) 297 


2. All populations have a common covariance matrix &. 


3. Each population is multivariate normal. 


Condition 3 can be relaxed by appealing to the central limit theorem (Result 4.13) 
when the sample sizes nz, are large. 

A review of the univariate analysis of variance (ANOVA) will facilitate our 
discussion of the multivariate assumptions and solution methods. 


A Summary of Univariate ANOVA 


In the univariate situation, the assumptions are that X¢1, X¢2,.-., Xen, 38 a random 
sample from an N(¢, 07) population, € = 1, 2,..., g, and that the random samples 
are independent. Although the null hypothesis of equality of means could be formu- 
lated as 4, = 2 =-*- = pg, it is customary to regard pe as the sum of an overall 
mean component, such as 4, and a component due to the specific population. For 
instance, we can write zg = wp + (pe — ) OF We = wh + Te Where Ty = pe — B- 

Populations usually correspond to different sets of experimental conditions, and 
therefore, it is convenient to investigate the deviations te associated with the ¢th 
population (treatment). 

The reparameterization 


Me = Bb + T¢ 


[ceca i) ( €th population ) (6-32) 


mean mean (treatment) effect 


leads to a restatement of the hypothesis of equality of means The nul) hypothesis 
becomes 


Ho: = 72 =": = 7, =0 


The response X;;, distributed as N(z + 7,¢, 07), can be expressed in the suggestive 
form 


Xo = 7 + Te + ee; 
treatment random (6-33) 
(overall mean) 
effect error 


where the ec; are independent N(0, o*) random variables To define uniquely 
the model parameters and their least squares estimates, it is customary to impose the 


constraint > neTe = 0. 
(1 


Motivated by the decomposition in (6-33), the analysis of variance is based 
upon an analogous decomposition of the observations, 


Xe; = x ¥ (Xe — x) + (xj — Xe) 


overall estimated 
sample mean treatment effect 


(6-34) 


(observation) ( } (residual) 


where X is an estimate of 4, 7¢ = (X¢ — X) is an estimate of rz, and (xe; — Xe) is an 
estimate of the error e¢;. 


298 Chapter 6 Comparisons of Several Multivariate Means 


Example 6.7 (The sum of squares decomposition for univariate ANOVA) Consider 
the following independent samples. : 
Population 1: 9,6,9 
Population 2: 0,2 . 
Population 3: 3, 1,2 
Since, for example, %=(3+1+2)/3=2 and ¥=(9+64+9+042+4. 
3+ 1+ 2)/8 = 4, we find that 
3 = x31 =X+ (x; — x) + (23) =_ x3) 
=4+ (2-4) + (3-2) 


~ 


=4+(-2) +1 
Repeating this operation for each observation, we obtain the arrays 
9 6 9 444 4 4 4 1-2 4 
Qa 2 = |4 4 + -3 -3 + -1 1 
3 1 2 444 —2 -2 ~2 1-10 
observation = mean +  treatmenteffect + residual 
(xe;) (x) (Xe ~ x) (Xe; ~ Xe) 


The question of equality of means is answered by asseSsing whether the 
contribution of the treatment array is large relative to the residuals. (Our esti- 
mates 7¢ = Xe — X Of Te always satisfy > nete = 0. Under Hp, each 7, is an 

e=1 


estimate of zero.) If the treatment contribution is large, Ho should be rejected. The 
size of an array is quantified by stringing the rows of the array out into a vector and 
calculating its squared length. This quantity is called the sum of squares (SS). For 
the observations, we construct the vector y’ = [9, 6, 9, 0, 2, 3, 1, 2]. Its squared 


length is 
Som = 7 +0 +P +0 + 274+37+ + 2? = 216 
Similarly, 
SSimean = 42 +47 +47 +47 447 +47 4474 4? = 8(4?) = 128 
SS, = 42 + 42 + 4? + (-3)? + (-3) + (2)? + (-2)? + (-2) 
= 3(42) + 2(-3)? + 3(-2) = 78 
and the residual sum of squares is 
SS, 22+ (-2p +P + (-1P +P +P + (-1P + = 10 
The sums of squares satisfy the same decomposition, (6-34), as the observations. 
Consequently, 
SSobs = SSmean + SSu + SSyes 


or216 = 128 + 78 + 10. The breakup into sums of squares apportions variability in 
the combined samples into mean, treatment, and residual (error) components. Ag 
analysis of variance proceeds by comparing the relative sizes of SS,, and SSyes. If Ho 
is true, variances computed from SS,, and SS,-, should be approximately equal. = 


Comparing Several Multivariate Population Means (One-way MANOVA) 299 


The sum of squares decompesition illustrated numerically in Example 6.7 is so 
basic that the algebraic equivalent will now be developed. 
Subtracting X from both sides of (6-34) and squaring gives 


(xe; — ¥)° = (Se — ¥)? + (x0; — Fe)? + 2 Ke — E)( xe — Fe) 


n 
We can sum both sides over /, note that by (%e; — Xe) = 0, and obtain 
j=l 


> (Xe; - %)” = ne(Xp — %) + p (xe; — Xe)” 


j=l i71 


Next, summing both sides over ¢ we get 


> ¥ (xe; — ¥)° = > ne(Xe ~ ¥) aa DS (xe; ~ ¥e) (6-35) 
1 j=l a j 


SScor 2 SSres } 
total (corrected) SS between ais SS within (samples) SS 


gt 


D Dy = (m+ om te + n,)x? + nee ~ 3) + D3 > (xe; — Ee)" 


¢=1 j=l @=1 j=] 
(SSobs) = (SSmean) of (SSi) Ra (SSyes) (6-36) 


In the course of establishing (6-36), we have verified that the arrays represent- 
ing the mean, treatment effects, and residuals are orthogonal. That is, these arrays, 
considered as vectors, are perpendicular whatever the observation vector 
Y= [Xapeee es Mpays Arb9-0+s A2ags +++ Xen, ]. Consequently, we could obtain SS,,; by 
subtraction, without having to calculate the individual residuals, because SSyes = 
SSops — SSmean — SSir- However, this is false economy because plots of the residu- 
als provide checks on the assumptions of the model. 

The vector representations of the arrays involved in the decomposition (6-34) 
also have geometric interpretations that provide the degrees of freedom. For an ar- 


fl 


bitrary set of observations, let [x,;,.--, Xin, Raises Tomy ++ Teng ] = y’. The ob- 
Servation vector y can lie anywhere in mn =n, +n, +: “f Ng dimensions; the 
mean vector ¥1 = [X,..., ¥]' must lie along the eaillengalae line of 1, and the treat- 


ment effect vector 


[2 0 0 

: ny : 

1 0 0 

(%, — ¥) | 0 + (%-%)/1 ++ (%, — %) | 0 

: ny 

0 1 

0 1 

: ; i | png 
[0 0 1- 


= (4 — X)u, + (Z, — X)un +--+ + (%y — X) Uy 


300 Chapter 6 Comparisons of Several Multivariate Means 


lies in the hyperplane of linear combinations of the g vectors u),Uz,..., Ug. Since - 
1 =u) + wu) +--- + uy, the mean vector also lies in this hyperplane, and it is : 
always perpendicular to the treatment vector. (See Exercise 6.10.) Thus, the mean * 
vector has the freedom to lie anywhere along the one-dimensional equiangular line, 
and the treatment vector has the freedom to lie anywhere in the other g — | di-~ 
mensions. The residual vector,é = y — (x1) — [(%, ~ ¥)u, + +++ + (%. — X)u eis 
perpendicular to both the mean vector and the treatment effect vector ‘and has the 
freedom to lie anywhere in the subspace of dimension n — (g ~1).— 1 =n ~ g~ 
that is perpendicular to their hyperplane. re 

To summarize, we attribute 1 d-f. to SSiean.g — 1 d.f. to SS,, and n — gs : 
(ny + ny toot nz) — g df. to SS,,,. The total number of degrees of freedom is ~ 
n= ny + ny +--+ + ng. Alternatively, by appealing to the univariate distribution 
theory, we find cha these are the degrees of freedom for the chi-square distributions — 
associated with the corresponding sums of squares. 

The calculations of the sums of squares and the associated degrees of freedom 
are conveniently summarized by an ANOVA table. 


ANOVA Table for Comparing Univariate Population Means 


Source Degrees of 
of variation Sum of squares (SS) freedom (df) 
Treatments SSu = s ne(E¢ — X) g-l 
=1 
Residual 
(error) SStes = Dp > (xe; ~ Fe) > neg 
f=) j=1 t=1 
Total (corrected ne 
for the mean) SScor = > SY (xe; - ¥) > ne -1 
(=1 j= t= 
The usual F-test rejects Hy: 7, = 72 = +": = T, = Oatlevel aif 
SSi/(g - 1 
F= /( ) 


g > Fe—1.2n—g(@) 
85eu/ (Sim <4 ‘) 
f=] 


where F,_1 5n,-g(a@) is the upper (100a)th percentile of the F-distribution with 
g —1and ine — g degrees of freedom. This is equivalent to rejecting Ho for 
large values of SS,,/SS,,., or for large values of 1 + SS,,/SS,.,- The statistic 
appropriate for a multivariate generalization rejects Hy for small values of the 
reciprocal 


Hh neti Sie 
1+ SSu/SSyes_ SSpes. + SS, 


(6-37) 


Comparing Several Multivariate Population Means (One-way MANOVA) 301 


Example 6.8 (A univariate ANOVA table and F-test for treatment effects) Using the 
information in Example 6.7, we have the following ANOVA table: 


Source 
of variation Sum of squares Degrees of freedom 
Treatments SS, = 78 g-1=3-1=2 
Residual SSyes = 10 S npg = (34243)-355 
é=1 
is 
Total (corrected) SScor = 88 SY ne-1=7 
é=1 
Consequently, 


p= Sule = 1) _ 78/2 
~ SSres/(Zne — g) 10/5 


Since F = 19.5 > F5(.01) = 13.27, we reject Hg: 7, = 7, = 73 = 0 (no treatment 
effect) at the 1% level of significance. a 


19.5 


Multivariate Analysis of Variance (MANOVA) 


Paralleling the univariate reparameterization, we specify the MANOVA madel: 


MANOVA Model For Comparing g Population Mean Vectors 


Xej = M+ Te + €G;, J =1,2,...,n¢ and € = 1,2,...,2 (6-38) 


where the eg; are independent N,(0, &) variables. Here the parameter vector mw 
is an overall mean (level), and ze represents the th treatment effect with 


& 
>, NeTe = 0. 
e=1 


According to the madel in (6-38), each component of the observation vector X¢; Sat- 
isfies the univariate model (6-33). The errors for the components of X¢; are corre- 
lated, but the covariance matrix & is the same for all populations. 

A vector of observations may be decomposed as suggested by the model. Thus, 


X¢j = x a (Xe = x) + (Xe; ma Xe) 
estimated . (6-39) 
j dual 
(observation) Se treatment tee i ) 
Sad effect 7; ee; 


The decomposition in (6-39) leads to the multivariate analog of the univariate 
sum of squares breakup in (6-35). First we note that the product 


(xe; — X)(xe; — X)' 


302 Chapter 6 Comparisons of Several Multivariate Means 


can be written as 
(xe; — &) (Xe; — ¥)) = [ej — Re) + (he — HD] [ej — Ke) + (He — HT 
= (xj — Ke) (Xe; — Ke)! + (Koj — Ke) (Ke — *)' 
: + (He — 8) (Xe; — Ke)’ + (Re — ¥) (Ke - ZY 
The sum over j of the middle two expressions is the zero matrix, because. 


n 
>» (xe; ~ ¥e) = 0. Hence, summing the cross product over @ and j yields 
fl 


y > (x ~ 8) (xe; — #) = > ne(Ke ~ X) (Ke ~ X) + > 2 (xe; — Xe) (Xe; — Xe)’ 
sis Aix 


total (corrected) sum treatment (Between) residual (Within) sum (6-40) 
of squares and cross sum of squares and of squares and crass 
products J cross products products 


The within sum of squares and cross products matrix can be expressed as 
ge ; 
=> (xe, ~ Xr) { (xij — X¢) 
é=1 j=l 

= (m — 1)S) + (m2 ~ IS) +--+ + (ng — 1)8 


where S; is the sample covariance matrix for the ¢th sample. This matrix is a gener- 
alization of the (1 + M2 ~ 2)Spooled matrix encountered in the two-sample case. It 
plays a dominant role in testing for the presence of treatment effects. 

Analogous to the univariate result, the hypothesis of no treatment effects, 


(6-41) 


Hjity = 7, = = 7, =0 


is tested by considering the relative sizes of the treatment and residual sums of 

squares and cross products. Equivalently, we may consider the relative sizes of the 

residual and total (corrected) sum of squares and cross products. Formally, we sum- 
marize the calculations leading to the test statistic ina MANOVA table. 


MANOVA Table for Comparing Population Mean Vectors 


Source Matrix of sum of squares and Degrees of 
of variation cross products (SSP) freedom (d.f) 

8 

Treatment B= ¥ nclke — ¥) (Ke — ¥)’ g-l 
is) 

g fe g 

Residual (Error) 9 W= >) > (Xe; ~ ¥e) (Xe; ~ Xe)’ S ne - 8 
c=] j=1 é=1 

Total (corrected g Mm 

for the mean) B+W= > > (xe — ¥) (xe; — *)’ ne - 1 


&=1 j=1 é=1 


Comparing Several Multivariate Population Means (One-way MANOVA) 303 


This table is exactly the same form, component by component, as the ANOVA table, 
except that squares of scalars are replaced by their vector counterparts. For exam- 
ple, (Xe — x)? becomes (X~ — X)(X¢ — x)’. The degrees of freedom correspond to 
the univariate geometry and also to some multivariate distribution theory involving 
Wishart densities. (See [1].) 

One test of Hp: 7; = 72 =-+-- = 7, = 0 involves generalized variances. We re- 
ject Hp if the ratio of generalized variances 


g A alos 
iw] > > (Xe; ~ Xe) (Xe; — Xe) 
x= =e (6-42) 
|B + Wi a . a 
BD eG ~ BR; 8) 
=1 j= 


is too small. The quantity A* = |W|/|B + W|, proposed originally by Wilks 
(see [25]), corresponds to the equivalent form (6-37) of the F-test of Hy: no treat- 
ment effects in the univariate case. Wilks’ lambda has the virtue of being convenient 
and related to the likelihood ratio criterion? The exact distribution of A* can be 
derived for the special cases listed in Table 6.3. For other cases and large sample 
sizes, a modification of A* due to Bartlett (see [4]) can be used to test Hp. 


Table 6.3 Distribution of Wilks’ Lambda, A* = |W|/|B + W| 


No. of No. of 
variables groups 


Sampling distribution for multivariate normal data 


A, of WB as 


Wilks’ lambda can also be expressed as a function of the eigenvalues of An My 


5 1 
At = = 
t(24) 


where s = min(p,g — 1), the rank of B. Other statistics for checking the equality of several multivari- 
ate means, such as Pillai’s statistic, the Lawley—Hotelling statistic, and Roy’s largest root statistic can also 
be written as particular functions of the eigenvalues of W~'B. For large samples, all of these statistics are, 
essentially equivalent. (See the additional discussion on page 336.) 


304 Chapter 6 Comparisons of Several Multivariate Means 


Bartlett (see [4]) has shown that if Hp is true and ng = nis large, 


-(n =I ~ C8 ings = -(n Pe ees Pn ew) (6-43) 


has approximately a chi-square distribution with p(g — 1) d.f. Consequently, for 
Lng = n large, we reject Ho at significance level a if 


-(n je (P ; 2) n() > Xp(g~1)(@) (6-44) 


where Xlg-1 )(@) is the upper (100a)th percentile Of a chi-square distribution with 
P(g ~ I) dt 


Example 6.9 (A MANOVA table and Wilks’ lambda for testing the equality of three 
mean vectors) Suppose an additional variable is observed along with the variable 
intraduced in Example 6.7: The sample sizes are n, = 3,n) = 2, and n3 = 3. 
Arranging the observation pairs x ¢; in rows, we obtain 


i] 2] .] ae pl a=[}]. z= (|, 


°] | with xk; = 
4! lo ._|4 

eey we 

8 9 7 

We have already expressed the observations on the first variable as the sum of an 


overall mean, treatment effect, and residual in our discussion of univariate 
ANOVA. We found that 


969 4 4 4 4 4 4 eee 
0 2 = |4 4 + 3x3 + ~1 1 
3°12 4 4 4 =2) =2) =2 1 -1 0 
A treatment : 
(observation) (mean) ( effect (residual) 


and 
SSoos = SSmean + SSte + SSpes 


216 = 128 + 78 + 10 
Total SS (corrected) = SS.45 — SSmean = 216 — 128 = 88 


Repeating this operation for the observations on the second variable, we have 


3°27 5.35 5 =1) =]. =1 =] +2: 3 
4 0 = 5 5 + 23° 3 + 2 -2 
89 7 5 5 5 SP 32° od 0 1 -1 


: treatment P 
b: 
(observation) (mean) ( eticet ) (residual) 


Comparing Several Multivariate Population Means (One-way MANOVA) 305 


and 
SSops = SSmean + SSip + SSres 
272 = 200 + 48 + 24 
Total SS (corrected) = SS.y. — SSmean = 272 — 200 = 72 


These two single-component analyses must be augmented with the sum of entry- 
by-entry cross products in order to complete the entries in the MANOVA table. 
Proceeding row by row in the arrays for the two variables, we obtain the cross 
product contributions: 


Mean: 4(5) + 4(5) +--- + 4(5) = 8(4)(5) = 160 
Treatment: 3(4)(—1) + 2(—3)(—3) + 3(-2) (3) = -12 
Residual: 1(—1) + (—2)(—2) + 1(3) + (-1)(2) +--- + 0(-1) =1 
Total: 9(3) + 6(2) + 9(7) + 0(4) +--- + 2(7) = 149 
Total (corrected) cross product = total cross product — mean cross product 
= 149 - 160 = -11 
Thus, the MANOVA table takes the following form: 


Oe  —  ———— —_—— 


Source Matrix of sum of squares 
of variation and cross products Degrees of freedom 
Treatment fee =| 3-1=2 
Residual | fe | 34+24+3-3=5 
Total (corrected) Be | 7 


Equation (6-40) is verified by noting that 


[al 


Using (6-42), we get 


12 aa 10 1 
48 1 24 


ie 1 
. w 1 24 10(24) — (1)? 
ome ad 0028) CY" 789.9395 


-11 72 


(B+ Wi | 88 —11|  88(72) - (11)? 6215 


306 Chapter 6 Comparisons of Several Multivariate Means 


Since p = 2 and g = 3, Table 6.3 indicates that an exact test (assuming norma}. = 
ity and equal group covariance matrices) of Hp: 7) = 72 = 73 = 0 (no treatment ‘ 
effects) versus Hj: at least one re # O is available. To carry out the test, we compare: 
the test statistic 


L—- VA*\ (2% -g-1) _ (1 — V.0385 (8527) - ar fs 
Vat (g - 1) 0385 ae 
with a percentage point of an F-distribution having vy = 2(g — 1) = 4 and. 


vy = 2(lne — g — 1) = 8 4.£ Since 8.19 > F,9(.01) = 7.01, we reject Ay at the. 
= 01 level and conclude that treatment differences exist. ma: 


When the number of variables, p, is large, the MANOVA table is usually not 
constructed. Still, it is good practice to have the computer print the matrices B and 
W so that especially large entries can be located. Also, the residual vectors 


Ce; = Xe; — Xe 


should be examined for normality and the presence of outliers using the techniques 
discussed in Sections 4.6 and 4.7 of Chapter 4. 


Example 6.10 (A multivariate analysis of Wisconsin nursing home data) The 
Wisconsin Department of Health and Social Services reimburses nursing homes in 
the state for the services provided. The department develops a set of formulas for 
rates for each facility, based on factors such as level of care, mean wage rate, and 
average wage rate in the state. 

Nursing homes can be classified on the basis of ownership (private party, 
nonprofit organization, and government) and certification (skilled nursing facility, 
intermediate care facility, or a combination of the two). 

One purpose of a recent study was to investigate the effects of ownership or 
certification (or both) on costs. Four costs, computed on a per-patient-day basis and 
measured in hours per patient day, were selected for analysis: X, = cost of nursing 
labor,.X, = cost of dietary labor, X3; = cost of plant operation and maintenance labor, 
and X, = cost of housekeeping and laundry labor. A total of n = 516 observations 
on each of the p = 4 cost variables were initially separated according to ownership. 
Summary statistics for each of the g = 3 groups are given in the following table. 


Number of 
Group observations Sample mean vectors 
€ = 1 (private) ny = 271 
2.066 2.167 2.273 
- 480 = 596 x 521 
€ = 2 (nonprofit) n, = 138 5 | ge pane a ON 495 
360 418 383 


€ = 3 (government) n3 = 107 


3 
Ss ne = 516 
f=1 


Comparing Several Multivariate Population Means (One-way MANOVA) 307 


Sample covariance matrices 


291 561 
-001 .011 _ | 011.025 
Si=1 002 000 .001 : S2= 1 001 004.005 ; 
010 .003 .000 .010 037.007 .002 .019 
261 
030 017 


83=1 003 ~.000 .004 


018 006 .001 .013 


Source: Data courtesy of State of Wisconsin Department of Health and Social Services. 


Since the S¢’s seem to be reasonably compatible,’ they were pooled [see (6-41)] 


to obtain 
W = (nm, — 1)8, + (m2 — 1)S2 + (m3 — 1)8; 
182.962 
4.408 8.200 
1.695 633 1.484 
9.581 2.428 .394 6.538 
Also, 
2.136 
ee 7X) a 2X2 + 3X3 x 519 
ny + My + M3 .102 
380 
and 
3.475 
3 
Br otis taco PA Vd 
nF > nee — X)(Re—*) = 1 oo 453 035 


584 .610 .230 .304 


To test Ho: 7) = 72 = 73 (no ownership effects or, equivalently, no difference in av- 
erage costs among the three types of owners—private, nonprofit, and government), 
we can use the result in Table 6.3 for g = 3. 

Computer-based calculations give 


| W| 


= 
A |B + WI 


= .7714 


3 However, a normal-theory test of Ho: Z, = £2 = E3 would reject Mp at any reasonable signifi- 
cance level because of the large sample sizes (see Example 6.12). 


308 Chapter 6 Comparisons of Several Multivariate Means 


and 
Ine - p-2 (2) = (= =2) (2 SNTTIBY oon 
p Vas 4 aia ye 


Leta = 01, so that Fy4,2(510)(-01) = xa(.01)/8 = 2.51. Since 17.67 > Fs.1020(.01) = 
2.51, we reject Hp at the 1% level and conclude that average costs differ, depending on 


type of ownership. 
It is informative to compare the results based on this “exact” test with those 


obtained using the large-sample procedure summarized in (6-43) and (6-44). For the 
present example, Zp = n = 516 is Jarge, and Hp can be teSted at the a = 01 leve| 
by comparing : 


~(n~1-(p+e)/2) n( ew) = ~511.5In(.7714) = 132.76 


with y3¢¢—1)(.01) = xg(.01) = 20.09. Since 132.76 > y§(.01) = 20.09, we reject Hy 
at the 1% level. This result is consistent with the result based on the foregoing 
F-statistic. = 


6.5 Simultaneous Confidence Intervals for Treatment Effects 


When the hypothesis of equal treatment effects is rejected, those effects that led to 
the rejection of the hypothesis are of interest. For pairwise comparisons, the Bon- 
ferroni approach (see Section 5.4) can be used to construct simultaneous confidence 
intervals for the components of the differences 7, — rp (or wx — we). These inter- 
vals are shorter than those obtained for al] contrasts, and they require critical values 
only for the univariate f-statistic. 

Let 7,; be the ith component of 7,. Since 7, is estimated by 7, = X, — XK 


Thi = Xqi ~ (6-45) 


and t,, — 7; = %; ~ Xe; is the difference between two independent sample means, 
The two-sample t-based confidence interval is valid with an appropriately 
modified a. Notice that 


Var (74: — ei) = Vat (Xj; — XG) = (2 + de, 


where o;;; is the ith diagonal element of X. As suggested by (6-41), Var (X,; — Xei) 


is estimated by dividing the corresponding element of W by its degrees of freedom. 
That is, 


where w;; is the ith diagonal element of W and = ny +--+ + ny. 


Simultaneous Confidence Intervals for Treatment Effects 309 


It remains to apportion the error rate over the numerous confidence state- 
ments. Relation (5-28) still applies. There are p variables and g(g — 1)/2 pairwise 
differences, so each two-sample t-interval will employ the critical value ¢,,-,(@/2m), 
where 


m = pag ~ 1)/2 (6-46) 


is the number of simultaneous confidence statements. 


Result 6.5. Let n = y n,- For the model in (6-38), with confidence at least 


k= 
(1 - a), 
belongsto Fei * taal ad ) il 1, 1) 
TT, — Te; belongsto X,; — Xe; + th-,.| ——— Sa [SE 
a . lad ®\pe(g—1)/Va-ge\me ne 
for all components i = 1,..., p and all differences € < k = 1,..., g. Here wj; is the 


ith diagonal element of W. 


We shall illustrate the construction of simultaneous interval estimates for the 
pairwise differences in treatment means using the nursing-home data introduced in 
Example 6.10. 


Example 6.11 (Simultaneous intervals for treatment differences—nursing homes) 
We saw in Example 6.10 that average costs for nursing homes differ, depending on 
the type of ownership. We can use Result 6.5 to estimate the magnitudes of the dif- 
ferences. A comparison of the variable X3, costs of plant operation and maintenance 
labor, between privately owned nursing homes and government-owned nursing 
homes can be made by estimating 7,3 ~ 733. Using (6-39) and the information in 
Example 6.10, we have 


—.070 137 
a = = —.039 is = = -002 
7 = (x; —K) = ~.020 |" 73 = (X3 - x) = 023 
~.020 .003 
182.962 
Ww 4.408 8.200 
1.695 .633 1.484 
9.581 2.428 .394 6,538 
Consequently, 


713 — 733 = —.020 — .023 = ~.043 


and nm = 271 + 138 + 107 = 516, so that 


1 1 W33 ( I 1 ) 1.484 
J{—+—|—- = Jf += )\—™ =.00 
(2 1) nm-g 271 107 / 516 — 3 a 


310 Chapter 6 Comparisons of Several Multivariate Means 


Since p = 4 and g = 3, for 95% simultaneous confidence statements we require 
that ts;3{.05/4{3)2) = 2.87. (See Appendix, Table 1.) The 95% simultaneous confj. 
dence statement is 


3 c 1 1 
713 — 733 belongsto 73 ~ 733 + t513(-00208) (4 + 1 ar 
~.043 + 2.87(.00614) 
~.043 + .018, or (~.061, -.025) 


tt 


W 


We conclude that the average maintenance and labor cost for government-owned 
nursing homes is higher by .025 to .061 hour per patient day than for privately 
owned nursing homes. With the same 95% confidence, we can say that 


713 ~ 73 belongs to the interval (-.058, ~.026) 


and 
74 — 743 belongs to the interval (—.021, .019) 


Thus, a difference in this cost exists between private and nonprofit nursing homes, 
but no difference is observed between nonprofit and government nursing homes. wy 


6.6 Testing for Equality of Covariance Matrices 


One of the assumptions made when comparing two or more multivariate mean vec- 
tors is that the covariance matrices of the potentially different populations are the 
same. (This assumption will appear again in Chapter 11 when we discuss discrimina- 
tion and classification.) Before pooling the variation across samples to form a 
pooled covariance matrix when comparing mean vectors, it can be worthwhile to 
test the equality of the popvlation covariance matrices. One commonly employed 
test for equal covariance matrices is Box's M-test ({8], [9]). 
With g populations, the null hypothesis is 


Hy: %) =% = = B= % (6-47) 


where %¢ is the covariance matrix for the €th population, ¢ = 1, 2,...,g, and Sis 
the presumed common covariance matrix. The alternative hypothesis is that at feast 
two of the covariance matrices are not equal. 

Assuming multivariate normal populations, a likelihood ratio statistic for test- 


ing (6-47) is given by (see [1]) 
S (ne- I) 
A= m(s_ ) (6-48) 
= \1 Spootea | 
Here np is the sample size for the €th group, S, is the ¢th group sample covariance 
matrix and S,,o1eq is the pooled sample covariance matrix given by 
1 

——=—  -- {(my - 18, + (my — 198, +--+ (ng — DS,} (6-49) 


Spoolea = 
Lire =) 


Testing for Equality of Covariance Matrices 311 


Box’s test is based on his y* approximation to the sampling distribution of —2 In A 
(see Result 5.2). Setting -2 In A = M (Box’s M statistic) gives 


M = px - 1) in| Speal ~ lem — 1)n|S¢]] (6-50) 

Tf the null hypothesis is true, the individual sample covariance matrices are not 
expected to differ too much and, consequently, do not differ too much from the 
pooled covariance matrix. In this case, the ratio of the determinants in (6-48) will all 
be close to 1, A will be near 1 and Box’s M statistic will be small. If the null hypoth- 
esis is false, the sample covariance matrices can differ more and the differences in 
their determinants will be more pronounced. In this case A will be small and M will 
be relatively large. To illustrate, note that the determinant of the pooled covariance 
matrix, | Spooled |, will lie somewhere near the “middle” of the determinants | S, I's of 
the individual group covariance matrices. As the latter quantities become more 
disparate, the product of the ratios in (6-44) will get closer to 0. In fact, as the | S¢|’s 
increase in spread, |8()|/|Spootea| reduces the product proportionally more than 
|S.) |/| Spootes | increases it, where |S,)| and |S(,)| are the minimum and maximum 
determinant values, respectively. 


Box’s Test for Equality of Covariance Matrices 


Set 


(6-51) 


2p? +3p-1 
is 1 a | P 


Dy emer i 6(p9 7+ )ie—)D 
£ (ne 1) Dine = 1) 6(p + 1)(g 1) 


where p is the number of variables and g is the number of groups. Then 


C=(1-u)M=(1- [Sey] In | Spootea | ~ D(re- 1) In |se| (6-52) 
a 
has an approximate y’ distribution with 


v= gs Ppt) —-Zp(p+V=Fp(e+iMe-1) 653) 


degrees of freedom. At significance level a, reject Hp if C > Xpcp41)¢¢-1y2(@). 


Box’s x? approximation works well if each ne exceeds 20 and if p and g do not 
exceed 5. In situations where these conditions do not hold, Box ([7}, [8]) has provided 
a more precise F approximation to the sampling distribution of M. 


Example 6.12 (Testing equality of covariance matrices—nursing homes) We intro- 
duced the Wisconsin nursing home data in Example 6.10. In that example the 
sample covariance matrices for p = 4 cost variables associated with g = 3 groups 
of nursing homes are displayed. Assuming multivariate normal data, we test the 
hypothesis Hy: 2%, = 22, = 23 = &. 


312 Chapter 6 Comparisons of Several Multivariate Means 


Using the information in Example 6.10, we have n, = 271, nz = 13g 
nz = 107 and |S,| = 2.783 x 1078, || = 89.539 x 10°, |S3| = 14.579 x 10° ang 
| Spootea | = 17398 x 10°. Taking the natural logarithms of the determinants gives 
in] S,] = -17.397, In |S) | = —13.926, In | $3] = —15.741 and In | Spootea| = ~15.564, 


We calculate 
|| shoe i feet 1 [eS Peer 
“270” 137° 106 270+ 137+ 106 || 6(44+1(3-1) |] ~ 


M = [270 + 137 + 106](—15.564) ~ [270(-17.397) + 137(-13.926) + 106(—15.741)] 
= 2893 


and C = (1 — .0133)289.3 = 285.5. Referring C to a x table with v = 4(4 + 1)(3 - 1)72 
= 20 degrees of freedom, it is clear that Hp is rejected at any reasonable level of sig- 
nificance. We conclude that the covariance matrices of the cost variables associated 
with the three populations of nursing homes are not the same. m 


Box’s M-test is routinely calculated in many statistical computer packages that 
do MANOVA and other procedures requiring equal covariance matrices. It is 
known that the -test is sensitive to some forms of non-normality. More broadly, in 
the presence of non-normality, normal theory tests on covariances are influenced by 
the kurtosis of the parent populations (see [16]). However, with reasonably large 
samples, the MANOVA tests of means or treatment effects are rather robust to 
nonnormality. Thus the M-test may reject Hg in some non-normal cases where it is 
not damaging to the MANOVA tests. Moreover, with equal sample sizes, some 
differences m covariance matrices have little effect on the MANOVA tests. To 
summarize, we may decide to continue with the usual MANOVA tests even though 
the M-test leads to rejection of Hp. 


6.7 Two-Way Multivariate Analysis of Variance 


Following our approach to the one-way MANOVA, we shall briefly review the 
analysis for a univariate two-way fixed-effects model and then simply generalize to 
the multivariate case by analogy. 


Univariate Two-Way Fixed-Effects Model with Interaction 


We assume that measurements are recorded at various levels of two factors. In some 
cases, these experimental conditions represent levels of a single treatment arranged 
within several blocks. The particular experimental design employed will not concern 
us in this book. (See [10] and [17] for discussions of experimental design.) We shall, 
however, assume that observations at different combinations of experimental condi- 
tions are mdependent of one another. 

Let the two sets of experimental conditions be the levels of, for instance, factor 
1 and factor 2, respectively.’ Suppose there are g levels of factor 1 and b levels of fac- 
tor 2,and that 7 independent observations can be observed at each of the gb combi- 


‘The use of the term “factor” to indicate an experimental condition is convenient. The factors dis- 
cussed here should not be confused with the unobservable factors considered in Chapter 9 in the context 
of factor analysis. 


Expected response 


Expected response 


Two-Way Multivariate Analysis of Variance 313 


nations of levels. Denoting the rth observation at level € of factor 1 and level k of 


factor 2 by X¢ex,, we specify the univariate two-way model as 


Xekr = + Te + Bat Vex + Cekr 


€=1,2,...,g 
k=1,2,...,5 (6-54) 
r=1,2,...,n 


> 


& b g b 
where > T= > B= > Yer = > vex =0 and the eg,, are independent 
é=1 k=l é=1 k= 


N(0, 07) random variables. Here » represents an overall level, re represents the 
fixed effect of factor 1, 8, represents the fixed effect of factor 2, and ye, is the inter- 
action between factor 1 and factor 2. The expected response at the €th level of factor 
1 and the kth level of factor 2 is thus 


E(Xegr) = Be + Te + By + Vek 
mean _ { overall : effect of s effect of Pe factor 1-factor ”) 
response level factor 1 factor 2 interaction 
€ = 1,2,...,¢, k =1,2,...,b (6-55) 


The presence of interaction, yg, implies that the factor effects are not additive 
and complicates the interpretation of the results. Figures 6.3(a) and (b) show 


Level ! of factor 1 
Level 3 of factor | 
Level 2 of factor 1 


N 
w 
eS 


Level of factor 2 
(a) 


Level 3 of factor 1 
Level 1 of factor 1 


Level 2 of factor 1 


| | | > 


1 2 3 4 
Level of factor 2 


Figure 6.3 Curves for expected 
responses (a) with interaction and 
©) (b) without interaction. 


314 Chapter 6 Comparisons of Several Multivariate Means 


expected responses as a function of the factor levels with and without interaction 
respectively. The absense of interaction means yg, = 0 for all and k. 
in a manner analogous to (6-55), each observation can be decomposed as 


Xen = E+ (Ke. ~ X) + (Kg ¥) + (Hem — Xe. — Ke + ¥) + (Keer — Fey) (6-56) 


where X is the overall average, X¢. is the average for the ¢th level of factor 1, ¥., ig 
the average for the kth level of factor 2, and X¢, is the average for the €th level of 
factor 1 and the Ath level of factor 2. Squaring and summing the deviations 
(Xeke ~ x) gives 

>> 


n , g b 
SS (xeer — ¥) = 3 dnl. ~ 4) + DS gn(¥-, - ¥) 
t=1k €=1 k=] 


1r=1\ 


Me 


bon 
+ > SD (eee — Fee)? (6-57) 


or 
SScor = SStaci + SStaco + SSint + SSyes 


The corresponding degrees of freedom associated with the sums of squares in the 
breakup in (6-57) are 


ebn ~1=(g~1)+ (b-1)+(g-1)(b~-1) + gbh(n-1) (6-58) 


The ANOVA table takes the following form: 


ANOVA Table for Comparing Effects of Two Factors and Their Interaction 


Source Degrees of 
of variation Sum of squares (SS) freedom (d.f.) 
7 8 
Factor 1 SStact = >> bn(Xe. — zy: g-l 
=1 
u = =\2 
Factor 2 SStacz = >> gn(X., — x) b-1 
=1 
g 5 2 
Interaction SSint = p> >> AXex — Ke. ~ ¥-y_ + X) (g — 1)(b - 1) 
=1 k=1 
bon 2 
Residual (Error) SSres = a = D (Zeer ~ Xce) gb(n — 1) 
=1k=1r=1 
bon a 
Total (corrected) SSoor = > D (xeer — ¥) gbn -—1 
c=1k=1r=1 


Two-Way Multivariate Analysis of Variance 315 


The F-ratios of the mean squares, SSj.1/(g — 1), SStaco/(b - 1), and 
SSint/(g — 1)(b — 1) to the mean square, SS,.,/(gb(m — 1)) can be used to test for 
the effects of factor 1, factor 2, and factor 1-factor 2 interaction, respectively. (See 
[11]} for a discussion of univariate two-way analysis of variance.) 


Multivariate Two-Way Fixed-Effects Model with Interaction 


Proceeding by analogy, we Specify the two-way fixed-effects model for a vector 
response consisting of p components [see (6-54)]} 


Keg = e+ te + Bot Ver + Cee 


€= 1,2... 

pee (6-59) 
k=1,2,...5b 
r=1,2,...,” 


& b b 
where > T= > Bie= 9 Yer = > Yex = 0. The vectors are all of order p X 1, 
(1 k= é=1 kal 


and the ec,, are independent N,(0, &) random vectors. Thus, the responses consist 
of p measurements replicated times at each of the possible combinations of levels 
of factors 1 and 2. 

Following (6-56), we can decompose the observation vectors x¢x, aS 


X chr = K+ (Kee — ) + (Ky — K) + (Keg — Kee — KE +H) + (Kear — Xen) (6-60) 


where x is the overall average of the observation vectors, X¢. is the average of the 
observation vectors at the ¢th level of factor 1, X., is the average of the observation 
vectors at the kth level of factor 2, and X-, is the average of the observation vectors 
at the ¢th level of factor 1 and the kth level of factor 2. 

Straightforward generalizations of (6-57) and (6-58) give the breakups of the 
sum of squares and cross products and degrees of freedom: 


g b ua & 
> SD (Xear — ¥) (Keer — K) = YS bn(Ke. — X) (Ke. — ¥) 
C=] k=] r=1 é=1 


b 
+ S gn(k-n — ¥) (XK, - ¥) 


=1 
g b 7 2 " Sates: 7 _ — 
+ SD nRee — Ke. — Keg + K) (Key — Ke — Ke +X) 
&=1 k=1 
bon 
+ D D Keer — Ken) Keer — Ken)! (6-61) 
é=1 k=1 r=1 


gbn ~1=(g-1) + (b-1)+(g-1)(b-1) + gb(n-1) (6-62) 


Again, the generalization from the univariate to the multivariate analysis consists 
simply of replacing a scalar such as (X¢. — x) with the corresponding matrix 
(Xp. — ¥)(Xe. — x)’. 


316 Chapter 6 Comparisons of Several Multivariate Means aq 


The MANOVA table is the following: 


MANOVA Table for Comparing Factors and Their Interaction 
Fh ee ete eg a Sta 


Degrees of = 
Source of Matrix of sum of squares freedom 
variation ~ and cross preducts (SSP) (df) 


g§ 
Factor 1 SSPraci = ps bn(Xe. — X) (Xe. ~ x)’ 
=] 
é 
Factor 2 SSPrac2 = >> ri(X-, ~ X) (Kx — x)’ b-1 
=1 


b ; 5 
Interaction SSPiat = 3 > n(Xex = Xe. oa Xx + x) (Xex = Ke. = Kyt x)’ (g as 1)(b = 
€=1k=] 


Residual bon ia - 

(Error) SSPies = > py D (mer ~ %ex) (Keke — ex)! gb(n — 1) 
é=1 k=1r=1 S 

Total bon 

(corrected) SSPyor = > > ay (xen, — ¥)(Xeur — XY gon -1 
(31 k=1r=1 


oda tin 
eka t 


A test (the likelihood ratio test)° of 
Ao: ¥11 = Yi2 =*** = Yep =O — (no interaction effects) (6-63; 
versus 
H,: Atleast one ye, # 0 
is conducted by rejecting Ho for small values of the ratio 
. [SSP 
| SSPint + SSPres | 


For large samples, Wilks’ lambda, A*, can be referred to a chi-square percentile. 
Using Bartlett’s multiplier (see [6]) to improve the chi-square approximation, % 


reject Ho: ¥11 = Y12 = °° = Yep = Oat the a level if 
+1i-(g-1)(o-1 
-| eo a 1) 4 Z e ui 4 InA* > Xbg-1)(0~1)e(®) (6-65 


A* (6-64 


where A* is given by (6-64) and x{¢~1)(6~1)p(a) is the upper (100a)th percentile of a 
chi-square distribution with (g — 1){(6 — l)pd.f. a 

Ordinarily, the test for interaction is carried out before the tests for main factor 
fects. If interaction effects exist, the factor effects do not have a clear interpretation: 
From a practical standpoint, it is not advisable to proceed with the additional muld 
variate tests. Instead, p univariate two-way analyses of variance (one for each varial 
are often conducted to see whether the interaction appears in some responses bu 


SThe likelihood test procedures require that p < gb(n ~ 1), so that SSPr., will be positive del 
(with probability 1). 


Two-Way Multivariate Analysis of Variance 317 


others. Those responses without interaction may be interpreted in terms of additive 
factor 1 and 2 effects, provided that the latter effects exist. In any event, interaction 
plots similar to Figure 6.3, but with treatment sample means replacing expected values, 
best clarify the relative magnitudes of the main and interaction effects. 

In the multivariate model, we test for factor 1 and factor 2 main effects as 
follows, First, consider the hypotheses Ho: 7, = 72 = --- = 7, = 0 and Hy: at least 
one re # 0. These hypotheses specify no factor 1 effects and some factor 1 effects, 
respectively. Let 


[SSPres| 


At = 
| SSPract rh: SSPres | 


(6-66) 


so that small values of A* are consistent with H,. Using Bartlett’s correction, the 
likelihood ratio test is as follows: 


Reject Ho: 7; = 72 = +++ = 7, = 0 (no factor 1 effects) at level a if 
pri=le=1) 
[00 = ]). = ak oa In A* > Xte~1)p() (6-67) 


where A* is given by (6-66) and x7,-1),(a) is the upper (100c)th percentile of a 
chi-square distribution with (g — 1)p df. 

In a similar manner, factor 2 effects are tested by considering Hp: B; = 
B. =-:: = B, = Vand H;: at least one B, # 0. Small values of 


= |SSP, asl 
|SSP, fac2 + SSPres| 


* 


(6-68) 


are consistent with H;. Once again, for large samples and using Bartlett’s correction: 
Reject Hy: B) = B. =--- = B, = 0 (no factor 2 effects) at level a if 


-| eR Prise) | In A* > x?,-1)p(a) (6-69) 


where A* is given by (6-68) and y7s-1),(a) is the upper (100a)th percentile of a 
chi-square distribution with (b ~ 1)p degrees of freedom. 

Simultaneous confidence intervals for contrasts in the model parameters 
can provide insights into the nature of the-factor effects. Results comparable to 
Result 6.5 are available for the two-way model. When interaction effects are 
negligible, we may concentrate on contrasts in the factor 1 and factor 2 main 
effects. The Bonferroni approach applies to the components of the differences 
Te — Tm of the factor 1 effects and the components of B, — B, of the factor 2 
effects, respectively. 

The 100(1 — a)% simultaneous confidence intervals for Te; — 7,,; are 


. {E; 2 
Tei — Tmi belongsto (Xe; — Xmvi) +4, (4-5) ary (6-70) 


where vy = gb(n — 1), E;, is the ith diagonal element of E = SSP,,,, and %¢.; — Xmi 
is the th component of X¢. — X,,. - 


318 Chapter 6 Comparisons of Several Multivariate Means 


Similarly, the 100(1 — @) percent simultaneous confidence intervals for By; ~ Bai 


akg = a Ej; 2 
Bui - Bai belongs to (4k x X-gi) t (se = 5) \ vgn (6-71) 


where v and E;;are as just defined and X.,; — X.4;is the ith component of X., — ane 


are 


Comment. We have considered the multivanate two-way model with replica- 
tions. That is, the model allows for n replications of the responses at each combina- 
tion of factor levels. This enables us to examine the “interaction” of the factors. If - 
only one observation vector is available at each combination of factor levels, the ' 
two-way model does not allow for the possibility of a general interaction term y», . 
The corresponding MANOVA table includes only factor 1, factor 2, and residual 
sources of variation as components of the total variation. (See Exercise 6.13.) 


Example 6.13 (A two-way multivariate analysis of variance of plastic film data) The | 
optimum conditions for extruding plastic film have been examined using a tech- 
nique called Evolutionary Operation. (See [9].) In the course of the study that was 
done, three responses—X, = tear resistance, X, = gloss, and X3; = opacity—were 
measured at two levels of the factors, rate of extrusion and amount of an additive. 
The measurements were repeated n = 5 times at each combination of the factor 
levels. The data are displayed in Table 6.4. 


Table 6.4 Plastic Film Data 


x, = tear resistance, x. = gloss, and x3 = opacity 


Factor 2: Amount of additive 
Low (1.0%) High (1.5%) 


| WwW 


[62 99 64] [72 10.0 20 
Low (~10)%| [58 96 3.0) [69 99 39 
[65 


Factor 1: Change (65 92 08] [63 94 5.7] 


; : eee 
in rate of extrusion fe Xn -X3 Xy XX, x3 


High (10%) [72 83 38] [72 97 69] 


[68 85 34] [76 92 1.9 


The matrices of the appropriate sum of squares and cross products were calcu- 
lated (see the SAS statistical software output in Panel 6.15), leading to the following 
MANOVA table: 


Additional SAS programs for MANOVA and other procedures discussed in this chapter are 
available in [13]. 


Two-Way Multivariate Analysis of Variance 31 9 


Source of vanation SSP df 
chaseeinvate 1.7405 —1.5045 8555 

Factor 1: Ee 1.3005 —.7395 1 
4205 
anuatae 6825 1.9305 

Factor 2: additive 6125 1.7325 1 
4.9005 
0005 .0165 0445 

Interaction ” 5445 se 1 
3.9605 
1.7640 .0200  --3.0700 

Residual 2.6280 =| 16 
64.9240 
4.2655  —.7855 —.2395 

Total (corrected) 5.0855 ss 19 
74.2055 


PANEL 6.f SAS ANALYSIS FOR EXAMPLE 6.13 USING PROC GLM 


title ‘MANOVA’; 

data film; 

infile 'T6-4.dat’; 

input x1 x2 x3 factor! factor2; 

proc glm data = film; PROGRAM COMMANDS 
class factor! factor2; 

model x1 x2 x3 = factor! factor2 factor! *factor2 /ss3; 

manova h = factor1 factor2 factor1*factor2 /printe; 

means factor1 factor2; 


General Linear Models Procedure 
Class Level Information 


Class Levels Values OUTPUT 
FACTOR} 2 01 
FACTOR2 2 01 


Number of Observations in data set = 20 


Dependent Variable: X1 


Source DF Sum of Squares Mean Square F Value Pr>F 
Model 3 2.50150000 0.83383333 7.56 0.0023 
Error 16 1.76400000 0.11025000 
Corrected Total 19 4.26550000 
R-Square [en's Root MSE X1 Mean 
0.586449 4.893724 0.332039 6.78500000 


Source DF Type it'SS| = Mean Square F Value “ Pr>F 


1 1.74050000 15.79 0.0011 
1 0.76050000 6.90 0.0183 
1 0.00050000 0.00 0.9471 


(continues on next page) 


320 Chapter 6 Comparisons of Several Multivariate Means 


PANEL 6.! (continued) 


Dependent Variable: X2 


Source DF Sum of Squares Mean Square 
Model 3 2.45750000 0.81916667 
Error 16 2.62800000 0.16425000 
Corrected Total 19 5.08550000 


R-Square CN. Root MSE 
0.483237 . 4.350807 0.405278 


Source Type lllSS MeanSquare _——F Value 


‘FACTOR. 1.30050000 7.92 
FACTOR2 0.61250000 3.73 
FACTOR1*FACTOR2 0.54450000 3.32 


Dependent Variable: X3 


Source DF Sum of Squares Mean Square 
Model 3 9.28150000 3.09383333 
Error 16 64.92400000 4.05775000 
Corrected Total 19 74.20SS0000 


R-Square CV. Root MSE X3 Mean 


0.125078 51.19151 2.014386 3.93500000 
Type ISS | Mean Square ProF 


0.42050000 0.7517 
| — 4.90050000 0.2881 
FACTOR1*FACTOR2 . 3.960S0000 0.3379 


E = Error 5S&CP Matrix 


xt x2 


1.764 0.02 
0.02 2.628 
-3.07 0.552 


Manova Test Criteria and Exact F Statistics for 


the | Hypothesis of no Overall FACTOR? Effect 


H = Type Ht SS&CP Matrix for FACTOR Error SS&CP Matrix 
$=1 M=0.5 


Num DF 

3 
Pillai’s Trace 0.61814162 7.5543 3 14 0.0030 
Hotelling-Lawley Trace 1.61877188 7.5543 3 14 0.0030 
Roy's Greatest Root 1.61877188 7.5543 3 14 0.0030 


{continues on next page) 


Two-Way Multivariate Analysis of Variance 321 


pANEL 6.1 (continued) 


Manova Test Criteria and Exact F Statistics for 


the | Hypothesis of no Overall FACTOR2 Effect 


H = Type Il! SS&CP Matrix for FACTOR2 E = Error SS&CP Matrix 
S=1 M=05 N=6 


‘Statistic | Value F Num DF Den DF Pr>F 
Wilks’ Lambda 0.52303490 4.2556 3 14 0.0247 
Pillai’s Trace 0.47696510 4.2556 3 14 0.0247 
Hotelling—Lawley Trace 0.91191832 4.2556 3 14 0.0247 
Roy's Greatest Root 0.91191832 4.2556 3 14 0.0247 


Manova Test Criteria and Exact F Statistics for 


the | Hypothesis of no Overall FACTOR1*FACTOR2 Effect 


H = Type III SS&CP Matrix for FACTOR1*FACTOR2 E = Error SS&CP Matrix 
$S=1 M=05 N=6 


Statistic Value F Num DF Den DF Pr>F 

Wilks’ Lambda 0.77710576 1.3385 3 14 0.3018 
Pillai’s Trace 0.22289424 1.3385 3 14 0.3018 
Hotelling-Lawley Trace 0.28682614 1.3385 3 14 0.3018 
Roy’s Greatest Root 0.28682614 1.3385 3 14 0.3018 
Levelof ene X1--------- 9 ---+~+---- X2------~- 
FACTOR1 N Mean sD Mean sD 
0 10 6.49000000 0.42018514 9.57000000 - 0.29832868 
1 10 7.08000000 0.32249031 9.06000000 0.57580861 

Levelof 2 een x3 -----~+-~ 

FACTOR1 N Mean sD 

0 10 3.79000000 1.85379491 

1 10 4.08000000 2.18214981 
Levelof ee eee eH X1-------+-- 9 -------+- X2-------- 
FACTOR2 N Mean sD Mean sD 
0 10 6.59000000 0.40674863 9.14000000 0.56015871 
1 10 6.98000000 0.47328638 9.49000000 0.42804465 

Levelof eee een XFaoco sees 

FACTOR2 N Mean 5D 

0 10 3.44000000 1.55077042 

1 10 4.43000000 2.30123155 


To test for interaction, we compute 


Aw = ——lSSPres]___ __ 275.7098 
[SSPint + SSPres| 354.7906 


= 7771 


322 Chapter 6 Comparisons of Several Multivariate Means 
For (g — 1)(b — 1) = 1, 
Be (: - ") (gb(n- 1) ~ p+ 1) 
A* J (l(g- 1-1) -pl+ Hp 
has an exact F-distribution with » =|(g—-1)(#-1) - p]|+1 and % 
gb(n — 1) — p + 1d (See [1].} For our example. 
ee ( - an) (2(2)(4) - 3 +.1)/2 _ 
77 ({1(1) -.3| + 1)/2 
¥ = ({1(1) — 3] + 1) =3 
v, = (2(2)(4) -3 + )H=14 


and F314(.05) = 3.34. Since F = 1.34 < Fyy4(.05) = 3.34, we do not reject the 
hypothesis Ao: ¥11 = Y12 = Y21 = ¥22 = O (no interaction effects). 

Note that the approximate chi-square statistic for this test is —[2(2)(4)= 
(3 + 1 — 1(1))/2] In(.7771) = 3.66, from (6-65). Since x3(.05) = 7.81, we would 
reach the same conclusion as provided by the exact F-test. 

To test for factor 1 and factor 2 effects (see page 317), we calculate 


. |SSPres]___-275.7098 
1 [SSPiac1 + SSPreg] 722.0212 


= .3819 


and 
|SSPres]|____—_-275.7098 


ere Lestat! |] 
Bis: |SSPhac2 + SSPres| 527.1347 


= 5230 


For both g ~1=1landb-1=1, 


n= (GM) ee ee 
; AL / (le-)-pl+)2 


and 


es (: = ) (gb(n-1)~ p+1)2 
: AS) (I@-)-pl+ D2 


have F-distributions with degrees of freedom »; =|(g—1)- p] +1, %= 
gh(n—1) — p + Landy = |(b — 1) — p| + 1,%=gb(n- 1) — pt 1, respec- 
tively. (See [1].) In our case, 


_ (1 = 3819) (16-3 +12 _ 
a =( 3819 roe 
_ (15230) (16-3 +12 _ 
A= ( 5230 eres nee 


and 
vy, =|{1-3)4+1=3 »=(16-3+1)=14 


Profile Analysis 323 


From before, F3,14(.05) = 3.34. We have Fy = 7.55 > Fy14(.05) = 3.34, and 
therefore, we reject Hp: +; = 7, = 0 (no factor 1 effects) at the 5% level. Similarly, 
F, = 4.26 > Fy14(.05) = 3.34, and we reject Hp: B, = B2 = 0 (no factor 2 effects) 
at the 5% level. We conclude that both the change in rate of extrusion and the amount 
of additive affect the responses, and they do so in an additive manner. 

The nature of the effects of factors 1 and 2 on the responses is explored in Exer- 
cise 6.15. In that exercise, simultaneous confidence intervals for contrasts in the 
components of r¢ and B, are considered. = 


6.8 Profile Analysis 


Profile analysis pertains to situations in which a battery of p treatments (tests, ques- 
tions, and so forth) are administered to two or more groups of subjects. All responses 
must be expressed in similar units. Further, it is assumed that the responses for the 
different groups are independent of one another. Ordinarily, we might pose the 
question, are the population mean vectors the same? In profile analysis, the question 
of equality of mean vectors is divided into several specific possibilities. 

Consider the population means yj = [141), 412, 4413, #414] Tepresenting the average 
responses to four treatments for the first group. A plot of these means, connected by 
straight lines, is shown in Figure 6.4. This broken-line graph is the profile for population 1. 

Profiles can be constructed for each population (group). We shall concentrate 
on two groups. Let mj = [p11,412,--- ip] and mw) = [421, M22,---, 2p] be the 
mean responses to p treatments for populations 1 and 2, respectively. The hypothesis 
Ao: 4, = #2 implies that the treatments have the same (average) effect on the two 
populations. In terms of the population profiles, we can formulate the question of 
equality in a stepwise fashion. 


1. Are the profiles parallel? 
Equivalently: Is Hoy: 41; — “1;-1 = Mai — Pai-1, | = 2,3,..., p, acceptable? 
2. Assuming that the profiles are parallel, are the profiles coincident? ” 
Equivalently: Is Ho2: 41; = @2;,i = 1,2,..., p, acceptable? 


Mean 
response 


tL + Variable Figure 6.4 The population profile 
1 2 3 4 p=4. 


7The question, “Assuming that the profiles are parallel, are the profiles linear?” is considered in 
Exercise 6.12. The nul) hypothesis of parallel linear profiles can be written Ho: (#1, + #21) 
— (yin t+ ami) = (ena + Bai-1} ~ (41-2 + Har-2), 1 = 3,-.., p. Although this hypothesis may be 
of interest in a particular situation, in practice the question of whether two parallel profiles are the same 
(coincident), whatever their nature, is usually of greater interest. 


324 Chapter 6 Comparisons of Several Multivariate Means 


3. Assuming that the profiles are coincident, are the profiles level? That is, are al} 
the means equal to the same constant? 


Equivalently: Is H3: p41 = Miz =*°* = ip = Mai = a2 =" = Hp acceptable? 
The null hypothesis in stage 1 can be written 


- Ay: Cy = Cp 
where C is the contrast matrix 
-1 100 0 0 
0 -1 1 =~0 0 0 
Cc = ‘ 3 ; (6-72) 
((p~1)Xp) : ie lage te ne \ 
0 000: -1 1 


For independent samples of sizes x, and 7, from the two populations, the null 
hypothesis can be tested by constructing the transformed observations 


Cx;;, J =1,2,...,m 
and 
Cx2;, J =1,2,...,m 


These have sample mean vectors Cx, and Cx), respectively, and pooled covariance 
matrix CS pooled’. 

Since the two sets of transformed observations have N,-;(Cy1, CXC’) and 
N,~1(Cuz2, CEC’) distributions, respectively, an application of Result 6.2 provides a 
test for parallel profiles. 


Test for Parallel Profiles for Two Normal Populations 


Reject Ho;: Cy; = Cyz2 (parallel profiles) at level a if 


ow eskpee bl) oa OL wiles 4 ae 
T? = (x, — X,)'C (2 a 1 espn | C(x; — X2) > ¢? (6-73) 


2 (+m Wp-V) 


Fw - 
my +m — p p~1,ny tng p(a) 


When the profiles are parallel, the first is either above the second (441; > Hai; 
for all i), or vice versa. Under this condition, the profiles will be coincident only if 
the total heights yxy + wy2 + +++ + wip = Uyy and py) + fo? +++ + bap = UD 
are equal. Therefore, the null hypothesis at stage 2 can be written in the equivalent 
form 


Ao: Vp, = Up 


We can then test Ho, with the usual two-sample t-statistic based on the univariate 
observations 1'x,;, j = 1,2,...,m, and 1’x);, j =1,2,..., m9. 


Profile Analysis 325 


Test for Coincident Profiles, Given That Profiles Are Parallel 
For two normal populations, reject Hy2:1'#41 = 1’ (profiles coincident) at 
level a if 


at, STS 1 1 dee aot 
Tr = 1'(x, ies X>) (2 a LVS at 1' (x, ae X2) 


ny 


"zs. = ; 6-74) 
1'(x; ~ X2) 2 a ( 
SS vena E = Fi nytny—2(@) oot 
1 1 
A + Me 1’ Spooteal 
For coincident profiles, x, 1, X12,---,X1n, aNd X21, X225++ +> X2nq are all observa- 


tions from the same normal population? The next step is to see whether all variables 


have the same mean, so that the common profile is level. 
When Hp, and Hp; are tenable, the common mean vector yp is estimated, using 


all ny + m2 ea: by 


ny 
ny = n2 = 
Xy; + x = ee 
ie ” p> :) (m +m) 1° (m +m)? 


- = wy, and the null hypothesis at 


x 
a 


If the common profile is level, ics My = hwo =°- 


stage 3 can be written as 
Ho3: Cp =0 


where C is given by (6-72). Consequently, we have the following test. 


Test for Level Profiles, Given That Profiles Are Coincident 
For two normal populations: Reject Hp3: Cu = 0 (profiles level) at level a if 
(ny + mz)X'C'[CSC'] CX > c? (6-75) 
where § is the sample covariance matrix based on all n, + 2 observations and 
(m + ”2 — 1)(p — 1) 
(m1 + my — p +1) 


= 
oo Fy-1,nytng—pt1(@) 


Example 6.14 (A profile analysis of love and marriage data) As part of a larger study 
of love and marriage, E. Hatfield, a sociologist, surveyed adults with respect to their 
marriage “contributions” and “outcomes” and their levels of “passionate” and 
“companionate” love. Recently married males and females were asked to respond 
to the following questions, using the 8-point scale in the figure below. 


> > 
ee o Bee o g a, 
52 £ Bs =e 22 52 2 a 2 
é z 
Qe - Ee =e ne Be. = & ak goa 
| fees | eee eee | 


i 
we 
- 
A 
an 
a) 


326 Chapter 6 Comparisons of Several Multivariate Means 


1. All things considered, how would you describe your contributions to the 
marriage? 
2. All things considered, how would you describe your outcomes from the 
marriage? 
Subjects were also asked to respond to the following questions, using the 
5-point scale shown. 


3. What is the level of passionate love that you feel for your partner? 
4. What is the level of companionate love that you feel for your partner? 


None Very A great Tremendous 
at all little Some deal amount 
[bo og | 
1 2 3 4 5 
Let 


x, = an 8-point scale response to Question 1 
X_ = an 8-point scale response to Question 2 
x3 = a5-point scale response to Question 3 
x4 = a5-point scale response to Question 4 


and the two populations be defined as 
Population 1 = married men 
; Population 2 = married women 
The population means are the average responses to the p = 4 questions for the 
populations of males and females. Assuming a common covariance matrix , it is of 


interest to see whether the profiles of males and females are the same. 
A sample of n, = 30 males and n, = 30 females gave the sample mean vectors 


6.833 6.633 
aa 7.033 aye 7.000 
“1 | 3.967 |’ *2~ | 4.000 
4.700 4.533 
(males) (females) 


and pooled covariance matrix 


606 .262 066 .161 

_ | 262 637 173.143 
pooled “| 066 .173 810 .029 
161 143 029 306 


S 


The sample mean vectors are plotted as sample profiles in Figure 6.5 on page 327. 

Since the sample sizes are reasonably large, we shall use the normal theory 
methodology, even though the data, which are integers, are clearly nonnormal. To 
test for parallelism (Hg1: Cu, = C2), we compute 


Profile Analysis 327 


sample mean 
response Xp; 
a & 
& N 
6 XN 


Key: 
X—— X Males 


O-— —o Females 
2 
Pr ese, | | _, Variable Figure 6.5 Sample profiles 
1 2 3 4 for marriage-love responses. 
~-1 1 0 0 Ss : 
CSpootedl’ = | 9 -1 1 0 |Spooied 
0 0-1 1 ee Ae 
0 oO 1 
19 —.268 —.125 
=] —.268 1.101 —.751 
-.125 ~.751 1.058 
and 
-1 1 0 0 ~.167 
C(%-%)=| 0 -1 1 Of] 4. ] =| -.066 
-1 1 ; .200 
0 #60 1 167 20 
Thus, 
719 ~.268 —.125 |] —.167 
T* = [—.167, -.066, 200] (4 +3)7] -.268 1.101 ~.751 | | ~.066 


~125 -.751 1.058 200 
= 15(.067) = 1.005 


Moreover, with a = .05, c? = [(30 +30 — 2) (4—1)/(30 +30— 4) ]F5,s6(.05) = 3.11(2.8) 
= 8.7. Since T* = 1.005 < 8.7, we conclude that the hypothesis of parallel profiles 
for men and women is tenable. Given the plot in Figure 6.5, this finding is not 
surprising. 

Assuming that the profiles are parallel, we can test for coincident profiles. To 
test Hp2: 1’, = 1p, (profiles coincident), we need 


Sum of elements in (X, — X2) = 1'(X, — ¥,) = .367 


Sum of elements in Spootea = L'Spootedt = 4.207 


328 Chapter 6 Comparisons of Several Multivariate Means 


Using (6-74), we obtain 


2 
Tj? = ( = 501 


With a = .05, F,s9(.05) = 4.0, and T? = .501 < Fisg(.05) = 4.0, we cannot reject 
the hypothesis that the profiles are coincident. That is, the responses of men and 
women to the four questions posed appear to be the same. 

We could now test for level profiles; however, it does not make sense to carry 
out this test for our example, since Questions 1 and 2 were measured on a scale of 
1-8, while Questions 3 and 4 were measured on a scale of 1-5. The incompatibility of 
these scales makes the test for level profiles meaningless and illustrates the need for 
similar measurements in order to carry out a complete profile analysis. Py 


When the sample sizes are small, a profile analysis will depend on the normality 
assumption. This assumption can be checked, using methods discussed in Chapter 4, 
with the original observations x¢; or the contrast observations C x¢ ;. 

The analysis of profiles for several populations proceeds in much the same 
fashion as that for two populations. In fact, the general measures of comparison are 
analogous to those just discussed. (See [13], [18].) 


6.9 Repeated Measures Designs and Growth Curves 


As we said earlier, the term “repeated measures” refers to situations where the same 
characteristic is observed, at different times or locations, on the same subject. 


(a) The observations on a subject may correspond to different treatments as in 
Example 6.2 where the time between heartbeats was measured under the 2 x 2 
treatment combinations applied to each dog. The treatments need to be com- 
pared when the responses on the same subject are correlated. 


(b) A single treatment may be applied to each subject and a single characteristic 
observed over a period of time. For instance, we could measuré the weight of a 
puppy at birth and then once a month. It is the curve traced by a typical dog that 
must be modeled. In this context, we refer to the curve as a growth curve. 

When some subjects receive one treatment and others another treatment, 
the growth curves for the treatments need to be compared. 


To illustrate the growth curve model introduced by Potthoff and Roy [21], we 
consider calcium measurements of the dominant ulna bone in older women. Besides 
an initial reading, Table 6.5 gives readings after one year, two years, and three years 
for the control group. Readings obtained by photon absorptiometry from the same 
subject are correlated but those from different subjects should be independent. The 
model assumes that the same covariance matrix ¥ holds for each subject. Unlike 
univariate approaches, this model does not require the four measurements to have 
equal variances. A profile, constructed from the four sample means (X;, X2, X3, X4)> 
summarizes the growth which here is a loss of calcium aver time. Can the growth 
pattern be adequately represented by a polynomial in time? 


Repeated Measures Designs and Growth Curves 329 


Table 6.5 Calcium Measurements on the Dominant Ulna; Control Group 


Mean 72.38 73.29 72.47 64.79 
<a 
Source: Data courtesy of Everett Smith. 
When the p measurements on all subjects are taken at times 4, %,.-.,f), the 


Potthoff-Roy model for quadratic growth becomes 


X Bo + Bit + Boti 
X_ | _ | Bot Bit + Both 


Xp Bo + Bitp + Bots 


where the ith mean p, is the quadratic expression evaluated at 4;. 

Usually groups need to be compared. Table 6.6 gives the calcium measurements 
for a second set of women, the treatment group, that received special help with diet 
and a regular exercise program. 

When a study involves several treatment groups, an extra subscript is needed as 
in the one-way MANOVA model. Let X¢1, X¢2,.-., Xen, be the ng vectors of 
measurements on the n¢ subjects in group ¢, for € = 1,..., g. 


Assumptions. All of the X¢j are independent and have the same covariance 
matrix &. Under the quadratic growth model, the mean vectors are 


Beo + Bat + Berti 14 g ae 
+ Bab + Bott 15 #8 
E[X¢j] = Beo Bat Bat | _|1 4 2] | Be | = BR; 
>| LBe 


2 
Beo + Beitp + Bert, 1 ft & 


330 Chapter 6 Comparisons of Several Multivariate Means 


Table 6.6 Calcium Measurements on the Dominant Ulna; Treatment 


Group 
Subject Initial 1 year 
1 83.8 85.5 86.2 81.2 
2 65.3 66.9 67.0 60.6 
3 81.2 79.5 84.5 75.2 
4 75.4 76.7 74.3 66.7 
5 §5.3 58.3 59.1 54.2 
6 70.3 72.3 70.6 68.6 
7 76.5 79.9 80.4 71.6 
8 66.0 70.9 703 64.1 
9 716.7 79.0 76.9 70.3 
10 77.2 74.0 778 67.9 
11 67.3 70.7 68.9 65.9 
12 50.3 51.4 53.6 48.0 
13 57.7 57.0 57.5 51.5 
14 74.3 771.7 72.6 68.0 
15 74.0 74.7 74.5 65.7 
16 57.3 56.0 64.7 53.0 
Mean 69.29 70.66 71.18 64.53 
Source: Data courtesy of Everett Smith. 
where 
14 #4 
1 . 2 Beo 
B=|. 7 ?] and Be=| Ber (6-76) 
1. te Ber 
Ifa gth-order polynomial is fit to the growth data, then 
1 hy Olaf Beo 
line 8 Be 
B=|_ || and Be=| | (6-77) 
1th - t Beg 


Under the assumption of multivariate normality, the maximum likelihood 
estimators of the Be are 


Be = (B'SpioedB) B’SueaXe for €=1,2,...,g (6-78) 
where 


1 
N~- 


Spooled = Wop im ~ 1)$1 +--+ + (ng - 1)8,) = 


Repeated Measures Designs and Growth Curves 331 


g 

with N = \>) ne, is the pooled estimator of the common covariance matrix %. The 
é=1 

estimated covariances of the maximum likelihoad estimators are 


— a k ae 
Cov(Be) = nz (B SpooteaB) ! for €=1,2,...,g (6-79) 


where k = (N — g)(N-—g-1)/(N-g-p+a)(N-g-p+q?1). 

Also, Be and B; are independent, for € # A, so their covariance is 0. 

We can formally test that a gth-order polynomial is adequate. The model is fit 
without restrictions, the error sum of squares and cross products matrix is just the 
within groups W that has N — g degrees of freedom. Under a gth-order polynomi- 
al, the error sum of squares and cross products 


w, => De) — BBY) Key ~ BB (6-80) 
has n, — g + p — q — 1 degrees of freedom. The likelihood ratio test of the null 
hypothesis that the g-order polynomial is adequate can be based on Wilks’ lambda 
_iwh 

| W,| 
Under the polynomial growth model, there are g + 1 terms instead of the p means 


for each of the groups Thus there are (p — q — 1)g fewer parameters. For large 
sample sizes, the null hypothesis that the polynomial is adequate is rejected if 


-(w Z 5(p agit 2)) INA* > Xfp-g-1)6(@) O82) 


* 


(6-81) 


Example 6.15 (Fitting a quadratic growth curve to calcium loss) Refer to the data in 
Tables 6.5 and 6.6. Fit the model for quadratic growth. 
A computer calculation gives 
73.0701 70.1387 
[B:,B2] =| 3.6444 4.0900 
—2.0274 —1.8534 
so the estimated growth curves are 
Controlgroup: 73.07 + 3.64 — 2.03z7 
(2.58) (83) (.28) 
Treatment group: 70.14 + 4.09% ~ 1.852” 
(2.50) (.80) (27) 
where 
93.1744 —5.8368 0.2184 
(B'S3ieaB) ' = | —5.8368 9.5699 —3.0240 
0.2184 -~-3.0240 1.1051 


and, by (6-79), the standard errors given below the parameter estimates were 
obtained by dividing the diagonal elements by mn, and taking the square root. 


332 Chapter 6 Comparisons of Several Multivariate Means 


Examination of the estimates and the standard errors reveals that the z? terms 
are needed. Loss of calcium is predicted after 3 years for both groups. Further, there 
does not seem to be any substantial difference between the two groups. 

Wilks’ lambda for testing the null hypothesis that the quadratic growth model is 
adequate becomes 

2726.282 2660.749 2369.308 2335.912 
2660.749 2756.009 2343.514 2327.961 
2369.308 2343.514 2301.714 2098.544 
om Iwi _ 1£2335,912 2327.961 2098.544. 2277.452 ae 
| W2| 2781.017  2698.589 2363.228 2362.253 ; 
2698.589 2832.430 2331.235 2381.160 
2363.228 2331.235 2303.687 2089.996 
2362.253 2381.160 2089.996 2314.485 


Since, with a = .01, 


-(w cE 5(p -q+t g))n At = -(1 - (4 —-2+ 2)) In .7627 
= 7.86 < X4~-2-1)2(.01) = 921 


we fail to reject the adequacy of the quadratic fit at a = .01. Since the p-value is less 
than .05 there is, however, some evidence that the quadratic does not fit well. 

We could, without restricting to quadratic growth, test for parallel and coinci- 
dent calcium loss using profile analysis. a 


The Potthoff and Roy growth curve model holds for more general designs than 
one-way MANOVA. However, the B, are no longer given by (6-78) and the expres- 
sion for its covariance matrix becomes more complicated than (6-79). We refer the 
reader to [14] for more examples and further tests. 

There are many other modifications to the madel treated here. They include the 
following: 


(a) Dropping the restriction to polynomial growth. Use nonlinear parametric 
models or even nonparametric splines. 


(b) Restricting the covariance matrix to a special form such as equally correlated 
responses on the same individual. 


(c) Observing more than one response variable, over time, on the same individual. 
This results m a multivariate version of the growth curve model. 


6.10 Perspectives and a Strategy for Analyzing 
Multivariate Models 


We emphasize that, with several characteristics, it is important to control the overall 
probability of making any incorrect decision. This is particularly important when 
testing for the equality of two or more treatments as the examples in this chapter 


Perspectives and a Strategy for Analyzing Multivariate Models 333 


indicate. A single multivariate test, with its associated single p-value, is preferable to 
performing a large number of univariate tests. The outcome tells us whether or not 
it is worthwhile to look closer on a variable by variable and group by group analysis. 

A single multivariate test is recommended over, say, p univariate tests because, 
as the next example demonstrates, univariate tests ignore important information 
and can give misleading results. 


Example 6.16 (Comparing multivariate and univariate tests for the differences in 
means) Suppose we collect measurements on two variables X, and X; for ten 
randomly selected experimental units from each of two groups. The hypothetical 
data are noted here and displayed as scatter plots and marginal dot diagrams in 
Figure 6.6 on page 334. 


x1 x2 Group 
5.0 3.0 1 
45 3.2 1 
6.0 3.5 1 
6.0 4.6 1 
6.2 5.6 1 
6.9 5.2 1 
6.8 6.0 1 
5.3 5.5 1 
6.6 73 1 
a ene Beigel Tot 
46 4.9 2 
4.9 5.9 2 
4.0 41 2 
3.8 5.4 2 
6.2 6.1 2 
5.0 7.0 2 
5.3 4.7 2 
71 6.6 2 
58 78 2 
68 8.0 2 


It is clear from the horizontal marginal dot diagram that there is considerable 
overlap in the x; values for the two groups. Similarly, the vertical marginal dot dia- 
gram shows there is considerable overlap in the x, values for the two groups. The 
scatter plots suggest that there is fairly strong positive correlation between the two 
variables for each group, and that, although there is some overlap, the group 1 
measurements are generally to the southeast of the group 2 measurements. 

Let mw} = [411, 412] be the population mean vector for the first group, and let 
#4 = [H21; 22] be the population mean vector for the second group. Using the 1 
observations, a univariate analysis of variance gives F = 2.46 with », = 1 and 
v2 = 18 degrees of freedom. Consequently, we cannot reject Hp: 44) = 42; at any 
reasonable significance level (F,,1g(.10) = 3.01). Using the x2 observations, a uni- 
variate analysis of variance gives F = 2.68 with v, = 1 andy», = 18 degrees of free- 
dom. Again, we cannot reject Ho: 442 = M22 at any reasonable significance level. 


334 Chapter 6 Comparisons of Several Multivariate Means 


x 7 


ot % "83 od4to , 


Figure 6.6 Scatter plots and marginal dot diagrams for the data from two groups. 


The univariate tests suggest there is no difference between the component means 
for the two groups, and hence we cannot discredit xy = peo. 

On the other hand, if we use Hotelling’s T? to test for the equality of the mean 
vectors, we find 


(18)(2) 
17 


and we reject Ho: #, = #2 at the 1% level. The multivariate test takes into account 
the positive correlation between the two measurements for each group—informa- 
tion that is unfortunately ignored by the univariate tests. This T7-test is equivalent to 
the MANOVA test (6-42). “ 


T? =17.29 >? = 


Fy y(01) = 2.148 X' 6.11 = 12.94 


Example 6.17 (Data on lizards that require a bivariate test to establish a difference in 
means) A zoologist collected lizards in the southwestern United States. Among 
other variables, he measured mass (in grams) and the snout-vent Sength (in millime- 
ters). Because the tails sometimes break off in the wild, the snout-vent Jength is a 
more representative measure of length. The data for the lizards from two genera, 
Cnemidophorus (C) and Sceloporus (S), collected in 1997 and 1999 are given in 
Table 6.7. Notice that there are n, = 20 measurements for C lizards and n, = 40 
measurements for S lizards. 
After taking natural logarithms, the summary statistics are 


Bet glsg. Bubp 5, = | 235305. 0.09417 
tee 1” | 4394 1” | 0.09417 0.02595 


_ _ [2368 0.50684 0.14539 
. = 40 — = 
Se [ | i pe aed 


Perspectives and a Strategy for Analyzing Multivariate Models 335 


1 


Table 6.7 Lizard Data for Two Genera 


SVL = snout-vent length. 
Source: Data courtesy of Kevin E. Bonine. 


In(Mass) 


39 40 41 42 43 44 45 46 47 48 
In(SVL) 
Figure 6.7 Scatter plot of In(Mass) versus In(SVL) for the lizard data in Table 6.7. 


A plot of mass (Mass) versus snout-vent length (SVL), after taking natural logarithms, 
is shown in Figure 6.7. The large sample individual 95% confidence intervals for the 
difference in In(Mass) means and the difference in In(SVL) means both cover 0. 
In(Mass): iy — B24: (0.476, 0.220) 
In(SVL):  yy2 — B22: (-0.011, 0.183) 


336 Chapter 6 Comparisons of Several Multivariate Means 


The corresponding univariate Student's f-test statistics for testing for no difference 
in the individual means have p-values of .46 and .08, respectively. Clearly, from a 
univariate perspective, we cannot detect a difference in mass means or a difference 
in snout-vent length means for the two genera of lizards. 

However, consistent with the scatter diagram in Figure 6.7, a bivariate analysis 
strongly supports a difference in size between the two groups of lizards. Using Result 
6.4 (also see Example 6.5), the T?-statistic has an approximate 3 distribution, 
For this example, T? = 225.4 with a p-value less than .0001. A multivariate method is 
essential in this case. 


Examples 6.16 and 6.17 demonstrate the efficacy of a multivariate test relative 
to its univariate counterparts. We encountered exactly this situation with the efflu- 
ent data in Example 6.1. : 

In the context of random samples from several populations (recall the one-way 
MANOVA in Section 6.4), multivariate tests are based on the matrices 


g 
w=2 
é=1 7 


(Xe; — Xe) (Xe; ~ Xe)’ and B= ) ne (Xe — X) (Xe — x)’ 
a 


Throughout this chapter, we have used 


|W] 
im | tist ee pee 
Wilks’ lambda statistic A IB+WI 


which is equivalent to the likelihood ratio test. Three other multivariate test statis- 
tics are regularly included in the output of statistical packages. 


Lawley—Hotelling trace = tr[BW™'] 
Pillai trace = tr[B(B + W)”'] 


Roy’s largest root = maximum eigenvalue of W(B + W)7 


All four of these tests appear to be nearly equivalent for extremely large sam- 
ples. For moderate sample sizes, all comparisons are based on what is necessarily a 
limited number of cases studied by simulation. From the simulations reported to 
date, the first three tests have similar power, while the last, Roy’s test, behaves dif- 
ferently. Its power is best only when there is a single nonzero eigenvalue and, at the 
same time, the power is large. This may approximate situations where a large 
difference exists in just one characteristic and it is between one group and all of the 
others, There is also some suggestion that Pillai’s trace is slightly more robust 
against nonnormality. However, we suggest trying transformations on the original 
data when the residuals are nonnormal. 

All four statistics apply in the two-way setting and in even more complicated 
MANOVA. More discussion is given in terms of the multivariate regression model 
in Chapter 7. 

When, and only when, the multivariate tests signals a difference, or departure 
from the null hypothesis, de we probe deeper. We recommend calculating the 
Bonferonni intervals for all pairs of groups and all characteristics. The simultaneous 
confidence statements determined from the shadows of the confidence ellipse are, 
typically, too large. The one-at-a-time intervals may be suggestive of differences that 


Exercises 


Exercises 337 


merit further study but, with the current data, cannot be taken as conclusive evi- 
dence for the existence of differences. We summarize the procedure developed in 
this chapter for comparing treatments. The first step is to check the data for outliers 
using visual displays and other calculations. 


A Strategy for the Multivariate Comparison of Treatments 


1. Try to identify outliers. Check the data group by group for outliers. Also 
check the collection of residual vectors from any fitted model for outliers. 
Be aware of any outliers so calculations can be performed with and without 
them. 

2. Perform a multivariate test of hypothesis. Our choice is the likelihood ratio 
test, which is equivalent to Wilks’ lambda test. 


we 
. 


Calculate the Bonferroni simultaneous confidence intervals. If the multi- 
variate test reveals a difference, then proceed to calculate the Bonferroni 
confidence intervals for all pairs of groups or treatments, and all character- 
istics. If no differences are significant, try looking at Bonferroni intervals for 
the larger set of responses that includes the differences and sums of pairs of 
responses. 


We must issue one caution concerning the proposed strategy. It may be the case 
that differences would appear in only one of the many characteristics and, further, 
the differences hold for only a few treatment combinations. Then, these few active 
differences may become lost among all the inactive ones. That is, the overall test may 
not show significance whereas a univariate test restricted to the specific active vari- 
able would detect the difference. The best preventative is a good experimental 
design. To design an effective experiment when one specific variable is expected to 
produce differences, do not include too many other variables that are not expected 
to show differences among the treatments. 


6.1. 


6.2. 


6.3. 


Construct and sketch a joint 95% confidence region for the mean difference vector 6 
using the effluent data and results in Example 6.1. Note that the point 6 = 0 falls 
outside the 95% contour. Is this result consistent with the test of Hp: 6 = 0 considered 
in Example 6.1? Explain. 


Using the information in Example 6.1. construct the 95% Bonferroni simultaneous in- 
tervals for the components of the mean difference vector 6. Compare the lengths of 
these intervals with those of the simultaneous intervals constructed in the example. 


The data corresponding to sample 8 in Table 6.1 seem unusually large. Remove sample 8. 
Construct a joint 95% confidence region for the mean difference vector 6 and the 95% 
Bonferroni simultaneous intervals for the components of the mean difference vector. 
Are the results consistent with a test of Hy: 6 = 0? Discuss. Does the “outlier” make a 
difference in the analysis of these data? 


338 Chapter 6 Comparisons of Several Multivariate Means 


6.4. Refer to Example 6.1. 
(a) Redo the analysis in Example 6.1 after transforming the pairs of observations tg | 
In(BOD) and In(SS). : 
(b) Construct the 95% Bonferroni simultaneous intervals for the components of the 
mean vector 6 of transformed variables. 
(c) Discuss any possible violation of the assumption of a bivariate normal distribution 
for the difference vectors of transformed observations. = 
6.5. A researcher considered three indices measuring the severity of heart attacks. The 
values of these indices for n = 40 heart-attack patients arriving at a hospital emergency * 
room produced the summary statistics a 


46.1 101.3 63.0 71.0 
Kx =| 573 | and S=! 630 802 55.6 
50.4 | - 71.0 55.6 97.4 


(a) All three indices are evaluated for each patient. Test for the equality of mean indices 
using (6-16) witha = .05. fl 
(b) Judge the differences in pairs of mean indices using 95% simultaneous confidence 
intervals. [See (6-18).] 
6.6. Use the data for treatments 2 and 3 in Exercise 6.8. 
(a) Calculate Spooied- 
(b) Test Hp: 42 — 3 = 0 employing a two-sample approach with a = .01. 
(c) Construct 99% simultaneous confidence intervals for the differences 2; — 43;, 
i= 1,2. 
6.7. Using the summary statistics for the electricity-demand data given in Example 6.4, com- 
pute T? and test the hypothesis Ho: #1 — #22 = 0, assuming that £; = E2. Set a = .05. 
Also, determine the linear combination of mean components most responsible for the 


rejection of Ho. 
6.8. Observations on two responses are collected for three treatments. The obser 


. x 
vation vectors | J are 
x2 


Treatment 1: | 


Treatment 2: ? 


Treat t3: 2 3 
reatment3: |, |, |, | 


(a) Break up the observations into mean, treatment, and residual components, as in 
(6-39). Construct the corresponding arrays for each variable. (See Example 6.9.) 

(b) Using the information in Part a, construct the one-way MANOVA table. 

(c) Evaluate Wilks’ lambda, A*, and use Table 6.3 to test for treatment effects. Set 
a = 01. Repeat the test using the chi-square approximation with Bartlett’s correc- 
tion. [See (6-43).] Compare the conclusions. 


6.9. 


6.10. 


6.12. 


Exercises 339 


Using the contrast matrix C in (6-13), verify the relationships d; = Cx;, d = Cx, and 
Sq = CSC’ in (6-14). 

Consider the univariate one-way decomposition of the observation x¢; given by (6-34). 
Show that the mean vector x 1 is always perpendicular to the treatment effect vector 
(%1 — X)uy + (%2. — X)un + --- + (%, — X)u, where 


1 0 0 

: ny : : 

1 0 

1 0 

w=|: U2 ={ if em ,...,u =]: 
0 1 


ar can) 
_ 


FR : a ag 


. A likelihood argument provides additional support for pooling the two independent 


sample covariance matrices to estimate a common covariance matrix in the case of two 
normal populations. Give the likelihood function, L(,, #22, %), for two independent 
samples of sizes ny and n from N,(#1,%) and N,(42, %) populations, respectively. Show 


that this likelihood is maximized by the choices 7; = *,, 42 = X2 and 


1 
ny + Nz 


ny t+n-2 
= ((my — 1)S) + (m2 — 1)82) = (mtn?) Spooled 


Hint: Use (4-16) and the maximization Result 4.10. 


(Test for linear profiles, given that the profiles are parallel.) Let pi = 

[o11,#12)-+-»Mtp] and > = [421,H22,--.,#2p| be the mean responses to p treat- 

ments for populations 1 and 2, respectively. Assume that the profiles given by the two 

mean vectors are parallel. 

(a) Show that the hypothesis that the profiles are linear can be written as Ho: (43; + 42:) — 
(Misi + M2i-1) = (rina + wai-i) — (Mai-2 + H2i-2), § = 3,--.,p or as Ho: 
C (qm; + a2) =0, where the (p — 2) X p matrix 


1-2 10:0 #00 
T= ate 

os ee ee 

0 0 00: 1 -2 1 


(b) Following an argument similar to the one leading to (6-73), we reject 
Ho: C (#1 + #2) = Oat level a if 


1 
T? = (x; + «'e| ( ! 


-1 
Pe + +) Spas C(K, + K) > c? 
where 


2 _ (m1 +m — 2)(p - 2) 
cz F. 
nyt+n-pti e 


~2,nytny—ptil@) 


340 Chapter 6 Comparisons of Several Multivariate Means 


Let ; = 30, ny = 30, x) = [6.4,6.8, 7.3, 7.0], &) = [4.3,4.9, 5.3, 5.1], and 


61 26 07 .16 

5. -| 26 64 17 44 

: Pete OF 17 S103 
16 14 03 31 


Fest for linear profiles, assuming that the profiles are parallel. Use a = .05. 


6.13. (Two-way MANOVA without replications.) Consider the observations on two 
responses, x, and x2, displayed in the form of the following two-way table (note that 
there is a single observation vector at each combination of factor levels): 


Factor 2 
Level Level Level Level 
1 2 3 4 


EG 


Level 1 


Factor 1 Level 2 


Level] 3 


With no replications, the two-way MANOVA model is 
b 
Xee= et tet Bet Cex; Sre= > Bea 0 
e=1 =1 

where the e¢, are independent N,(0, X) random vectors. 

(a) Decompose the observations for each of the two variables as 

Xe, =X+ (Xe. = x) + (Xx = x) + (xem - Xe. — Xt x) 
Similar to the arrays in Example 6.9. For each response, this decomposition will result 


in several 3 X 4 matrices. Here X is the overall average, Xp. is the average for the ¢th 
level of factor 1, and X., is the average for the kth level of factor 2. 


(b) Regard the rows of the matrices in Part a as Strung out in a single “Jong” vector, and 
compute the sums of squares 


SStor = SSmean + SStacr + SStac2 + SSres 
and sums of cross products 
SCProt = SCPrrean + SCPract + SCP ic2 + SCPres 


Consequently, obtain the matrices SSP.o,, SSP;4-1, SSPiacz, and SSP, with degrees 
of freedom gb - 1,g — 1,5 — 1, and(g — 1)(b — 1), respectively. 
(c) Summarize the calculations in Part b in a MANOVA table. 


6.14. 


6.15. 


Exercises 34! 


Hint: This MANOVA table is consistent with the two-way MANOVA table for com- 
paring factors and their interactions where n = 1. Note that, with n = 1, SSP,., in the 
general two-way MANOVA table is a zero matrix with zero degrees of freedom. The 
matrix of interaction sum of squares and cross products now becomes the residual sum 
of squares and cross products matrix. 

(d) Given the summary in Part c, test for factor 1 and factor 2 main effects at the a = .05 

level. 
Hint: Use the results in (6-67) and (6-69) with gb(n ~ 1) replaced by (g — 1) (6 — 1). 


Note: The tests require that p = (g — 1)(4 — 1) so that SSP,,, will be positive defi- 
nite (with probability 1). 


A replicate of the experiment in Exercise 6.13 yields the following data: 


Factor 2 
Level Level Level Level 


ey Wa: a 
cl La} Ls] 
fk seat eo 


(a) Use these data to decompose each of the two measurements in the observation 
vector aS 


Level 1 


Factor 1 Level 2 


Level 3 


Xex =X+ (Xe. —xX)+ (Xx = x) + (Xex = Xe. ~ Xu + x) 


where x is the overall average, X¢. is the average for the ¢th level of factor 1,and x, 
is the average for the kth level of factor 2. Form the corresponding arrays for each of 
the two responses. 


(b) Combine the preceding data with the data in Exercise 6.13 and carry out the neces- 
sary calculations to complete the general two-way MANOVA table. 


(c) Given the results in Part b, test for interactions, and if the interactions do not 
exist, test for factor 1 and factor 2 main effects. Use the likelihood ratio test with 
a= 05. 

(d) If main effects, but no interactions, exist, examine the nature of the main effects by 
constructing Bonferroni simultaneous 95% confidence intervals for differences of 
the components of the factor effect parameters. 

Refer to Example 6.13. 


(a) Carry out approximate chi-square (likelihood ratio) tests for the factor 1 and factor 2 
effects.Set a = .05. Compare these results with the results for the exact F-tests given 
in the example. Explain any differences. 


(b) Using (6-70), construct simultaneous 95% confidence intervals for differences in the 
factor 1 effect parameters for pairs of the three responses. Interpret these intervals. 
Repeat these calculations for factor 2 effect parameters. 


342 Chapter 6 Comparisons of Several Multivariate Means 


6.16. 


6.17. 


The following exercises may require the use of a computer. 


Four measures of the response stiffness on each of 30 boards are listed in Table 4.3 (see 
Example 4.14). The measures, on a given board, are repeated in the sense that they were 
made one after another. Assuming that the measures of stiffness arise from foyr 
treatments, test for the equality of treatments in a repeated measures design context, Set” 
a = 05. Construct a 95% (simultaneous) confidence interval for a contrast in the 
mean levels representing a comparison of the dynamic measurements with the static 
measurements. : 


The data in Table 6.8 were collected to test two psychological models of numerical 
cognition. Does the processing of numbers depend on the way the numbers are pre... 
sented (words, Arabic digits)? Thirty-two subjects were required to make a series of 


| Table 6.8 Number Parity Data (Median Times in Milliseconds) | 


WordDiff WordSame ArabicDiff ArabicSame 
(1) (x2) (x3) (x4) 
869.0 860.5 691.0 601.0 
995.0 875.0 678.0 659.0 

1056.0 930.5 833.0 826.0 
1126.0 954.0 888.0 728.0 
1044.0 909.0 865.0 839.0 
925.0 856.5 1059.5 797.0 
1172.5 896.5 926.0 766.0 
1408.5 1311.0 854.0 986.0 
1028.0 887.0 915.0 735.0 
1011.0 863.0 761.0 657.0 
726.0 674.0 663.0 583.0 
982.0 894.0 831.0 640.0 
1225.0 1179.0 1037.0 905.5 
731.0 662.0 662.5 624.0 
975.5 872.5 814.0 735.0 
1130.5 811.0 843.0 657.0 
945.0 909.0 867.5 754.0 
747.0 752.5 777.0 . 687.5 
656.5 659.5 572.0 539.0 
919.0 833.0 752.0 611.0 
751.0 744.0 683.0 553.0 
7740 735.0 671.0 612.0 
941.0 931.0 901.5 700.0 
751.0 785.0 789.0 735.0 
767.0 737.5 724.0 639.0 
813.5 750.5 711.0 625.0 
1289.5 1140.0 904.5 784.5 
1096.5 1009.0 1076.0 983.0 
1083.0 958.0 918.0 746.5 
1114.0 1046.0 1081.0 796.0 
708.0 669.0 657.0 572.5 
1201.0 925.0 1004.5 673.5 
Source: Data courtesy of J. Carr. 


6.18. 


6.19. 


Exercises 343 


quick numerical judgments about two numbers presented as either two number 
words (“two,” “four”) or two single Arabic digits (“2,” “4”). The subjects were asked 
to respond “same” if the two numbers had the same numerical parity (both even or 
both odd) and “different” if the two numbers had a different parity (one even, one 
odd). Half of the subjects were assigned a block of Arabic digit trials, followed by a 
block of number word trials, and half of the subjects received the blocks of trials 
in the reverse order. Within each block, the order of “same” and “different” parity 
trials was randomized for each subject. For each of the four combinations of parity and 
format, the median reaction times for correct responses were recorded for each 
subject. Here 


X, = median reaction time for word format-different parity combination 


X, = median reaction time for word format-same parity combination 
X; = median reaction time for Arabic format—different parity combination 
X, = median reaction time for Arabic format-same parity combination 


(a) Test for treatment effects using a repeated measures design. Set a = .05. 

(b) Construct 95% (simultaneous) confidence intervals for the contrasts representing 
the number format effect, the parity type effect and the interaction effect. Interpret 
the resulting intervals. 


(c) The absence of interaction supports the M model of numerical cognition, while the 
presence of interaction supports the C and C model of numerical cognition. Which 
madel is supported in this experiment? 


(a) For each subject, construct three difference scores corresponding to the number for- 
mat contrast, the parity type contrast, and the interaction contrast. Is a multivariate 
normal distribution a reasonable population mode! for these data? Explain. 


Jolicoeur and Mosimann [12] studied the relationship of size and shape for painted tur- 
tles. Table 6.9 contains their measurements on the carapaces of 24 female and 24 male 
turtles. 


(a) Test for equality of the two population mean vectors using a = .05. 

(b) If the hypothesis in Part a is rejected, find the linear combination of mean compo- 
nents most responsible for rejecting Hy. 

(c) Find simultaneous confidence intervals for the component mean differences 
Compare with the Bonferroni intervals. 

Hint: You may wish to consider logarithmic transformations of the observations. 


In the first phase of a study of the cost of transporting milk from farms to dairy plants, a 
survey was taken of firms engaged in milk transportation. Cost data on X, = fuel, 
X2 = repair, and X3 = capital, all measured on a per-mile basis, are presented in 
Table 6.10 on page 345 for ny = 36 gasoline and n, = 23 diesel trucks. 

(a) Test for differences in the mean cost vectors. Set a = .01. 


(b) If the hypothesis of equal cost vectors is rejected in Part a, find the linear combina- 
tion of mean components most responsible for the rejection. 


(c) Construct 99% simultaneous confidence intervals for the pairs of mean components. 
Which costs, if any, appear to be quite different? 

(d) Comment on the validity of the assumptions used in your analysis. Note in particular 
that observations 9 and 21 for gasoline trucks have been identified as multivariate 
outliers. (See Exercise 5.22 and [2].) Repeat Part a with these observations deleted. 
Comment on the results. 


344 Chapter 6 Comparisons of Several Multivariate Means 


Table 6.9 Carapace Measurements (in Millimeters) for 
Painted Turtles 


Female 


Length Width Height Length Width Height 
(41) (2) (x3) (41) (2) (3) 


6.20. The tail lengths in millimeters (x,) and wing lengths in millimeters (x2) for 45 male 
hook-billed kites are given in Table 6.11 on page 346. Similar measurements for female 
hook-billed kites were given in Table 5.12. 


(a) Plot the male hook-billed kite data as a scatter diagram, and (visually) check for out- 
liers. (Note, in particular, observation 31 with x, = 284.) 


(b) Test for equality of mean vectors for the populations of male and female hook- 
billed kites. Set a = .05. If Ho: x1 — 2 = 0 is rejected, find the linear combina- 
tion most responsible for the rejection of Hy. (You may want to eliminate any 
outliers found in Part a for the male hook-billed kite data before conducting this 
test. Alternatively, you may want to interpret x, = 284 for observation 31 as a mis- 
print and conduct the test with x, = 184 for this observation. Does it make any 
difference in this case how observation 31 for the male hook-billed kite data is 
treated?) 


(c) Determine the 95% confidence region for 4; — 2 and 95% simultaneous confi- 
dence intervals for the components of ge; — yr2- 


(d) Are male or female birds generally larger? 


Exercises 345 


Table 6.10 Milk Transportation-Cost Data 


Gasoline trucks Diesel trucks 


xX) x2 X3 xy) x2 X3 
16.44 12.43 11.23 8.50 12.26 9.11 

7.19 2.70 3.92 7.42 5.13 17.15 

9.92 1.35 9.75 10.28 3.32 11.23 

4.24 5.78 7.78 10.16 14.72 5.99 
11.20 5.05 10.67 12.79 4.17 29.28 
14.25 5.78 9.88 9.60 12.72 11.00 
13.50 10.98 10.60 6.47 8.89 19.00 
13.32 14.27 9.45 11.35 9.95 14.53 
29.11 15.09 3.28 9.15 2.94 13.68 
12.68 7.61 10.23 9.70 5.06 20.84 

751 5.80 8.13 9.77 17.86 35.18 

9.90 3.63 9.13 11.61 11.75 17.00 
10.25 5.07 10.17 9.09 13.25 20.66 
11.11 6.15 7.61 8.53 10.14 17.45 
12.17 14.26 14.39 8.29 6.22 16.38 
10.24 2.59 6.09 15.90 12.90 19.09 
10.18 6.05 12.14 11.94 5.69 14.77 

8.88 2.70 12.23 9.54 16.77 22.66 
12.34 7.73 11.68 10.43 17.65 10.66 

8.51 14.02 12.01 10.87 21.52 28.47 
26.16 17.44 16.89 7.13 13.22 19.44 
12.95 8.24 7.18 11.88 12.18 21.20 
16.93 13.37 17.59 12.03 9.22 23.09 
14.70 10.78 =: 14.58 
10.32 5.16 17.00 

8.98 4.49 4.26 

9.70 11.59 6.83 
12.72 8.63 5.59 


9.49 2.16 6.23 
8.22 7.95 6.72 
13.70 11.22 4.91 
8.21 9.85 8.17 
15.86 11.42 13.06 
9.18 9.18 9.49 
12.49 4.67 11.94 
17.32 6.86 4.44 


Source: Data courtesy of M. Keaton. 


6.21. Using Moody’s bond ratings, samples of 20 Aa (middle-high quality) corporate bonds 
and 20 Baa (top-medium quality) corporate bonds were selected. For each of the corre- 
sponding companies, the ratios 

X, = current ratio (a measure of short-term liquidity) 

X, = long-term interest rate (a measure of interest coverage) 

X;3 = debt-to-equity ratio (a measure of financial risk or leverage) 
X4 = rate of return on equity (a measure of profitability) 


346 Chapter 6 Comparisons of Several Multivariate Means 


Table 6.11 Male Hook-Billed Kite Data 


xX X2 
(Tail (Wing 
length) length) 


X2 
(Wing 
length) 


xy 
(Tail 
length) 


ae x2 
(Tail (Wing 
length) length) 


Source: Data courtesy of S. Temple. 


were recorded. The summary statistics are as follows: 
Aa bond companies: ny = 20,%4 = [2.287, 12.600, 347, 14.830], and 


459 254 ~—.026 -.244 
254 27.465 ~.589 —.267 
~.026 -.589 .030  .102 
7244 ~.267 = =—.102 6.854 


§, = 


Baa bond companies: nz = 20,%2 = [2.404,7.155, .524, 12.840], 


944 -—.089  .002 -.719 
~.089 16.432 -.400 19.044 
002 -400 024 —-.094 
—.719 19.044 -.094 61.854 


8, = 


and 


701.083 —.012 _-.481 
F _ | 083 21.949 -.494 9,388 
pooled "| ~012 -.494 027 004 
—481 9.388  .004 34.354 


(a) Does pooling appear reasonable here? Comment on the pooling procedure in this 


case. 


(b) Are the financial characteristics of firms with Aa bonds different from those with 
Baa bonds? Using the pooled covariance matrix, test for the equality of mean 


vectors. Set a = .0S. 


6.22. 


6.23. 


6.24. 


6.25. 


Exercises 347 


(c) Calculate the linear combinations of mean components most responsible for rejecting 
Ag: #1 — #2 = Oin Part b. 

(d) Bond rating companies are interested in a company’s ability to satisfy its outstanding 
debt obligatrons as they mature. Does it appear as if one or more of the foregoing 
financial ratios might be useful in helping to classify a bond as “high” or “medium” 
quality? Explain. 

(e) Repeat part (b) assuming normal populations with unequal covariance matices (see 
(6-27), (6-28) and (6-29)). Does your conclusion change? 


Researchers interested in assessing pulmonary function m nonpathological populations 
asked subjects to run on a treadmill until exhaustion. Samples of air were collected at 
definite intervals and the gas contents analyzed. The results on 4 measures of oxygen 
consumption for 25 males and 25 females are given in Table 6.12 on page 348. The 
variables were 


X, = resting volume O; (L/min) 

X, = resting volume O; (mL/kg/min) 
X3 = maximum volume O; (L/min) 

X4 = maximum volume O, (mL/kg/min) 


I 


(a) Look for gender differences by testing for equality of group means. Use a = .05. If 
you reject Ho: 4, — #2 = 0, find the linear combination most responsible. 


(b) Construct the 95% simultaneous confidence intervals for each uy; ~ #2;,¢ = 1,2,3,4. 
Compare with the corresponding Bonferroni intervals. 


(c) The data in Table 6.12 were collected from graduate-student volunteers, and thus 
they do not represent a random sample. Comment on the possible implications of 
this information. 


Construct a one-way MANOVA using the width measurements from the iris data in 
Table 11.5. Construct 95% simultaneous confidence intervals for differences in mean 
components for the two responses for each pair of populations. Comment on the validity 
of the assumption that £, = 22 = %3. 


Researchers have suggested that a change in skull size over time is evidence of the inter- 
breeding of a resident population with immigrant populations. Four measurements were 
made of male Egyptian skulls for three different time periods: period 1 is 4000 B.c., period 2 
is 3300 B.c., and period 3 is 1850 B.c. The data are shown in Table 6.13 on page 349 (see the 
skull data on the website www.prenhall.com/statistics). The measured variables are 


X, = maximum breadth of skull (mm) 
X, = basibregmatic height of skull (mm) 
X; = basialveolar Jength of skull (mm) 
X, = nasal height of skull (mm) 


Construct a one-way MANOVA of the Egyptian skull data. Use a = .05. Construct 95% 
simultaneous confidence intervals to determine which mean components differ among 
the populations represented by the three time periods Are the usuah MANOVA as- 
sumptions realistic for these data? Explain. 


Construct a one-way MANOVA of the crude-oil data listed in Table 11.7 on page 662. 
Construct 95% simultaneous confidence intervals te determine which mean compo- 
nents differ among the populations. (You may want to consider transformations of the 
data to make them more closely conform to the usual MANOVA assumptions ) 


LO'SE swt LES 
98°bE gp'z €7'¢ 
pE'6E ZEZ SOIT 
9P'6E €0'Z gts 
OP'0€ 61 Llp 
69'€b Boz aus 
09'Or rz €L's 
S6'8E set ssp 
9L'9b SO’ 699 
8L°9€ 8S-z 10% 
_O9'LE Or'z 08°Z 
oss 90°€ 09'S 
Oc’ 8€ os'z ov'b 
Or'te orz ors 
og'LE 9L7 0Ot'b 
L6'8Z IL. wo 
Ip ep Svz ZEL 
P6'6€ 86'T 97's 
IZ'6€ cad 99'p 
61'6E 6r'2 PLT 
OE'8E 2% Ls'p 
LB LE! 06° 6S 
Op'9E ez 88'P 
28°SE 1Sz SOE 
S8'EE €6'1 pro's 


Seo 
LE°0 
99°0 
£70 
Te0 
0€'0 
veo 
20 
vr'0 
8t0 
8T0 
££'0 
670 
SEO 
TEO 
Le0 
6£°0 
970 
$70 
110 
870 
0€'0 
Teo 
870 
670 


(uru/sy/Tou) = (an/7) 


rx fx x 


saTeUla,y 


(urwa/3¥/Tar) 


‘OUMUIxeE_ %Q UMUNXeEY 2% BuNsoy 
z 


(uiu/T) 
20 Sunsoy 


ly 


rE 
ceos 
or LS 
08"8S 
80's 
€2'Op 
0€'e9 
OL'TS 
StCP 
66°60 
60'8S 
L9°9¢ 
vE'SS 
ST TS 
9s'0s 
8E8v 
C6'8P 
98°8P 
cL8p 
0S"8p 
70'Lb 
00°9b 
IS'bb 
S8'&p 
28°0€ 


(urw/34/Tu) 
OQ UnuXE 


bx 


“HpLyOY_ °s yo Asoxno2 eweq ‘samMog 


787 Bop oro 
00'r Lop veo 
evs cos 0s'0 
00's cv9 cco 
Sse %0'9 (40) 
90°€ 009 OF'0 
08" 0€'9 8r'0 
Ore 66'S 9£°0 
6C'E £v'9 os’"0 
Ig€ 08'y ce'0 
98°E 89°9 br’0 
9C'E L6'v Te"0 
try S6'p 0r'0 
LOE Le°¢ (40) 
Lye 682 vs'0 
6S°€ Ser ce'0 
737 ILe 170 
OSE 69°9 8r'0 
6E"P Llp €v0 
S6E LO'v €e°0 
Ive Is'¢ 9€0 
09°€ SOE Te0 
ely ers 8r'0 
SEE 80°¢ 6£°0 
L387 Le ve'0 
(urw/T) = (urw/s¥/yu) = (urun/7) 
¢Q ununxep «= °Q BuTsoy % sunsoy 
fx ey ly 
e1eq uondumsu0)-a93hxQ zZI°9 a[qeL 


348 


Exercises 349 


Table 6.13 Egyptian Skull Data 


MaxBreath BasHeight BasLength NasHeight Time 
(x1) (x2) (x3) (x4) Period 
131 138 89 49 1 
125 131 92 48 1 
131 132 99 50 1 
119 132 96 44 1 
136 143 100 54 1 
138 137 89 56 1 
139 130 108 48 1 
125 136 93 48 1 
131 134 102 51 1 
134 134 99 51 1 
124 138 101 48 2 
133 134 97 48 2 
138 134 98 45 2 
148 129 104 51 2 
126 124 95 45 2 
135 136 98 52 2 
132 145 100 54 2 
133 130 102 48 2 
131 134 96 50 2 
133 125 9 46 2 
132 130 91 52 3 
133 131 100 50 3 
138 137 94 51 3 
130 127 99 45 3 
136 133 91 49 3 
134 123 95 52 3 
136 137 101 54 3 
133 131 96 49 3 
138 133 100 55 3 
138 133 91 46 3 

Source: Data courtesy of J. Jackson. 


6.26. A project was designed to investigate how consumers in Green Bay, Wisconsin, would 
react to an electrical time-of-use pricing scheme. The cost of electricity during peak 
periods for some customers was set at eight times the cost of electricity during 
off-peak hours. Hourly consumpticn (in kilowatt-hours) was measured on a hot summer 
day in July and compared, for both the test group and the control group, with baseline 
consumption measured on a similar day before the experimental rates began. The 
responses, 


log(current consumption) — log(baseline consumption) 


350 Chapter 6 Comparisons of Several Multivariate Means 


6.27. 


6.28. 


6.29. 


for the hours ending 9 A.M. 11 A.M. (a peak hour), 1 PM.,and 3 pM. (a peak hour) produceq 
the following summary statistics: 


Test group: ny = 28,X; = (.153, ~.231, — 322, — 339] 
Control group: Ny = 58,X, = [.151, 180, 256, 257] 
and 


804 355 28 232 
ee 355 722 233 199 
pooled “| 228 .233 .592 .239 
232.199 239.479 


Source: Data courtesy of Statistical Laboratory, University of Wisconsin. 


Perform a profile analysis. Does time-of-use pricing seem to make a difference in 
electrical consumption? What is the nature of this difference, if any? Comment. (Use a 
significance level of a = .05 for any statistical tests.) 


As part of the study of love and marriage in Example 6.14, a sample of husbands and 
wives were asked to respond to these questions: 
1. What is the level of passionate love you feel for your partner? 
2. What is the level of passionate love that your partner feels for you? 
3. What is the level of companionate Jove that you feel for your partner? 
4. What is the level of companionate love that your partner feels for you? 
The responses were recorded on the following S-point scale. 


None Very A great Tremendous 
at ail little Some deal amount 
{2 Se L ee a ee a 
1 2 3 4 5 


Thirty husbands and 30 wives gave the responses in Table 6.14, where X, = a 5-point- 
scale response to Question 1, X, = a 5-point-scale response to Question 2, X3 =a 
5-point-scale response to Question 3,and X4 = a 5-point-scale response to Question 4. 


(a) Plot the mean vectors for husbands and wives as sample profiles. 


(b) Is the husband rating wife profile parallel to the wife rating husband profile? Test 
for parallel profiles with a = .OS. If the profiles appear to be parallel, test for coin- 
cident profiles at the same level of significance. Finally, if the profiles are coinci- 
dent, test for level profiles with a = .05. What conclusion(s) can be drawn from this 
analysis? 

Two species of biting flies (genus Leptoconops) are so similar morphologically, that for 

many years they were thought to be the same. Biological differences such as sex ratios of 

emerging flies and biting habits were found to exist. Do the taxonomic data listed in part 
in Table 6.15 on page 352 and on the website www.prenhall.com/statistics indicate any 
difference in the two species L. carteri and L. torrens? Test for the equality of the two pop- 
ulation mean vectors using a = .05. If the hypotheses of equal mean vectors is rejected, 
determine the mean components (or linear combinations of mean components) most 
responsible for rejecting Ho. Justify your use of normal-theory methods for these data. 


Using the data on bone mineral content in Table 1.8, investigate equality between the 
dominant and nondominant bones. 


Exercises 351 


Table 6.14 Spouse Data 
Husband rating wife Wife rating husband 


a 
ta 
Nn) 
5 
Ba 


42 


a] 
w 
& 
a 


x4 
5 
4 » 
5 
4 
5 
5 
4 
5 
5 
3 
5 
4 
4 
5 
5 
5 
4 
5 
4 
4 
4 
4 
5 
5 
3 
4 
4 
5 
3 
5 


HPAWAWNAMNWAAANAANAAWAAANARA AAW HWAARUNAND 
HAW AWMAWHAAHAUNAANAWAWANAAHANAAAWWWUNUNW 
AKBNHHHWANKAHAAALH ANNA HAUNWUANRERUKHAUN AYA 
AA WA A AWHOANANWA MAHWAH AAA WA WA WAA AAA 
HAHA AWA ANHWWA NAMA AHA HAUANAHA HA AWWHRUNBUNA 
AMBABAAMA HRA HDAAHAKAHAANAHAAUNHBAAUNAA 
ALBA ADAMANAD HAH ANAANARUANANASAAHAUANAY 


Source: Data courtesy of E. Hatfield. 


(a) Test using a = .0S. 
(b) Construct 95% simultaneous confidence intervals for the mean differences. 


(c) Construct the Bonferroni 95% simultaneous intervals, and compare these with the 
intervals in Part b. 


6.30. Table 6.16 on page 353 contains the bone mineral contents, for the first 24 subjects in 
Table 1.8, 1 year after their participation in an experimental program. Compare the data 
from both tables to determine whether there has been bone loss. 


(a) Test using a = .05. 
(b) Construct 95% simultaneous confidence intervals for the mean differences. 


(c) Construct the Bonferroni 95% simultaneous intervals, and compare these with the 
intervals in Part b. 


352 Chapter 6 Comparisons of Several Multivariate Means 


Table 6.15 Biting-Fly Data 


X41 x2 x3 X%4 x5 x6 X7 


i i h 
( Wing (vine Third Third Fourt Length of Length of 


length) \ width palp palp palp antennal antennal 
length width length segment 12/ \segment 13 

85 41 31 13 25 9 8 
87 38 32 14 22 13 13 
94 44 36 15 27 - 8 9 
92 43 32 17 28 9 9 
96 43 35 14 26 10 10 
91 44 36 12 24 9 9 
90 42 36 16 26 9 9 
92 43 36 17 26 9 9 
91 41 36 14 23 9 9 
87 38 35 11 24 9 10 
L.torrens : E : : : : : 
106 47 38 15 26 10 10 


105 50 40 16 33 12 11 
99 47 39 14 34 Po 7 


Source: Data courtesy of William Atchley. | 


Exercises 353 


Table 6.16 Mineral Content in Bones (After 1 Year) 


Subject Dominant Dominant Dominant 

number radius Radius humerus Humerus ulna Ulna 
1 1.027 1.051 2.268 2.246 869 .964 

2 857 817 1.718 1.710 602 689 

3 875 880 1.953 1.756 765 738 

4 873 698 1.668 1.443 761 698 

5 811 813 1.643 1.661 551 619 

6 .640 .734 1.396 1.378 153 515 

7 947 865 1.851 1.686 108 187 

8 886 806 1.742 1.815 687 715 

9 .991 923 1.931 1.776 844 656 

10 977 925 1.933 2.106 869 789 
il 825 826 1.609 1.651 654 726 

12 851 765 2.352 1.980 692 526 
13 .710 730 1.470 1.420 .670 580 

14 912 875 1.846 1.809 823 .773 
15 905 826 1.842 1.579 7146 729 

16 .756 727 1.747 1.860 656 506 
17 765 .764 1.923 1.941 693 .740 

18 932 914 2.190 1.997 883 785 
19 843 -782 1.242 1228 S77 .627 

20 879 .906 2.164 1.999 802 769 
21 673 537 1.573 1.330 540 498 
22 .949 900 2.130 2.159 804 .779 
23 463 637 1.041 1.265 570 634 
24 .716 .743 1.442 1.411 585 640 

Source: Data courtesy of Everett Smith. 


6.31. Peanuts are an important crop in parts of the southern United States. In an effort to de- 
velop improved plants, crop scientists routinely compare varieties with respect to sever- 
al variables. The data for one two-factor experiment are given in Table 6.17 on page 354. 
Three varieties (5,6, and 8) were grown at two geographical locations (1, 2) and, in this 
case, the three variables representing yield and the two important grade~grain charac- 
teristics were measured. The three variables are 


X, = Yield (plot weight) 
X, = Sound mature kernels (weight in grams—maximum of 250 grams) 
X3 = Seed size (weight, in grams, of 100 seeds) 


There were two replications of the experiment. 

(a) Perform a two-factor MANOVA using the data in Table 6.17. Test for a location 
effect, a variety effect, and a location—variety interaction. Use a = .0S. 

(b) Analyze the residuals from Part a. Do the usual MANOVA assumptions appear to 
be satisfied? Discuss. 

(c) Using the results in Part a,can we conclude that the location and/or variety effects 
are additive? If not, does the interaction effect show up for some variables, but not 
for others? Check by running three separate univariate two-factor ANOVAs. 


354 Chapter 6 Comparisons of Several Multivariate Means 


Table 6.17 Peanut Data | 


Factor 1 Factor 2 xy x2 x3 
Location Variety Yield SdMatKer _ SeedSize 
cl 5 195.3 153.1 51.4 
1 5 194.3 167.7 53.7 
2 5 189.7 139.5 555 
2 5 180.4 121.1 444 
1 6 203.0 156.8 498 
1 6 195.9 - 166.0 45.8 
2 6 202.7 166.1 60.4 
2 6 197.6 161.8 54.1 
1 8 193.5 164.5 578 
1 8 187.0 165.1 58.6 
2 8 201.5 166.8 65.0 
2 8 200.0 173.8 672 
Source. Data courtesy of Yolanda Lopez. 


(d) Larger numbers correspond to better yield and grade-grain characteristics. Using lof 
cation 2, can we conclude that one variety is better than the other two for each char 
acteristic? Discuss your answer, using 95% Bonferroni simultaneous intervals fors 
pairs of varieties. 


6.32. In one experiment involving remote sensing, the spectral reflectance of three species off 
1-year-old seedlings was measured at various wavelengths during the growing seasong 
The seedlings were grown with two different levels of nutrient: the optimal levels 
coded +, and a suboptimal level, coded —. The species of seedlings used were sitk 
spruce (SS), Japanese larch (JL), and lodgepole pine (LP). Two of the variables mea 
sured were 


X, = percent spectral reflectance at wavelength 560 nm (green) 
X> = percent spectral reflectance at wavelength 720 nm (near infrared) 


The cell means (CM) for Julian day 235 for each combination of species and nutrient! 
level are as follows. These averages are based on four replications. ‘3 


560CM 720CM Species Nutrient 


10.35 25.93 ss + 4 
13.41 38.63 JL + ~ 

7.78 25.15 LP + q 
10.40 24.25 ss = “< 
17.78 41.45 JL ~ 
10.40 29.20 LP = G 


(a) Treating the cell means as individual observations, perform a two-way MANOVA. 
test for a species effect and a nutrient effect. Use a = .05. 

(b) Construct a two-way ANOVA for the 560CM observations and another two- " 
ANOVA for the 720CM observations Are these results consistent with ¢ al 
MANOVA results in Part a? If not, can you explain any differences? 


Exercises 355 


6.33. Refer to Exercise 6.32. The data in Table 6.18 are measurements on the variables 

Xj = percent spectral reflectance at wavelength 560 nm (green) 

X> = percent spectral reflectance at wavelength 720 nm (near infrared) 
for three species (sitka spruce [SS], Japanese larch [JL], and lodgepole pine [LP]) of 
1-year-old seedlings taken at three different times (Julian day 150 [1], Julian day 235 [2], 
and Julian day 320 [3]) during the growing season. The seedlings were all grown with the 
optimal level of nutrient. 

(a) Perform a two-factor MANOVA using the data in Table 6.18. Test for a species 
effect, a time effect and species—time interaction. Use a = .0S. 


Table 6.18 Spectral Reflectance Data 
560 nm 720 nm Species Time Replication 
9.33 19.14 SS 1 1 
8.74 19.55 SS 1 2 
9.31 19.24 SS 1 3 
8.27 16.37 Ss 1 4 
10.22 25.00 Ss 2 1 
10.13 25.32 Ss 2 2 
10.42 27.12 Ss 2 3 
10.62 26.28 SS 2 4 
15.25 38.89 SS 3 1 
16.22 36.67 Ss 3 2 
17.24 40.74 Ss 3 3 
12.77 67.50 Ss 3 4 
12.07 33.03 JL 1 1 
11.03 32.37 JL 1 2 
12.48 31.31 JL 1 3 
12.12 33.33 JL 1 4 
15.38 40.00 JL 2 1 
14.21 40.48 JL 2 2 
9.69 33.90 JL 2 3 
14.35 40.15 JL 2 4 
38.71 77.14 JL 3 1 
44.74 78.57 JL 3 2 
36.67 71.43 JL 3 3 
37.21 45.00 JL 3 4 
8.73 2327 LP 1 1 
7.94 20.87 LP 1 2 
8.37 22.16 LP 1 3 
7.86 21.78 LP 1 4 
8.45 26.32 LP 2 1 
6.79 22.73 LP 2 2 
8.34 26.67 LP 2 3 
7.54 24.87 LP 2 4 
14.04 44.44 LP 3 1 
13.51 37.93 LP 3 2 
13.33 37.93 LP 3 3 
12.77 60.87 LP 3 4 


Source: Data courtesy of Mairtin Mac Siurtain. 


356 Chapter 6 Comparisons of Several Multivariate Means 


(b) Do you think the usual MANOVA assumptions are satisfied for the these data? 
cuss with reference to a residual analysis, and the possibility of correlated obse: 
tions over time. 

(c} Foresters are particularly interested in the interaction of species and time. Does ires 
teraction show up for one variable but not for the other? Check by running a‘ungs 
variate two-factor ANOVA for each of the two responses. 

(d) Can you think of another method of analyzing these data (or a different experin 
tal design) that would allow for a potential time trend in the spectral reflectan 
numbers? 


6.34. Refer to Example 6.15. 


(a) Plot the profiles, the components of X; versus time and those of X, versus time,; 
the same graph. Comment on the comparison. : 


(b) Test that linear growth is adequate. Take a = .01. 


4.35. Refer to Example 6.15 but treat ali 31 subjects as a single group, The maximum like 
hood estimate of the (¢ + 1) X 1 B is : 


B = (B'S'"B)'B’S""x 


where § is the sample covariance matrix. 
The estimated covariances of the maximum likelihood estimators are 


(n ~ 1)(n - 2) 
(1-1~pt+q)(n- p+aq)n 


Cov(B) = (B’S“By" 


Fit a quadratic growth curve to this single group and comment on the fit. 


6.36, Refer to Example 6.4. Given the summary information on electrical usage in this e aii 
ple, use Box’s M-test to test the hypothesis Hy: %, = Z2 = X. Here X, is the coved 
ance matrix for the two measures of usage for the population of Wisconsin homeowners; 
with air conditioning, and &2 is the electrical usage covariance matrix for the population: 
of Wisconsin homeowners without air conditioning. Set a = .05. 


6.37. Table 6.9 page 344 contains the carapace measurements for 24 female and 24 male tar=. 
tles. Use Box’s M-test to test Hy: %, = Y2 = &. where ¥, is the population covarian\ 
matrix for carapace measurements for female turtles, and X, is the population cov. : 
ance matrix for carapace measurements for male turtles. Set a = .05. : 


6.38. Table 11.7 page 662 contains the values of three trace elements and two measures of hy: 
drocarbons for crude oil samples taken from three groups (zones) of sandstone. Use: 
Box’s M-test to test equatity of population covariance matrices for the three sandsto 
groups, Set a = .05. Here there are p = 5 variables and you may wish to consider ti 
formations of the measurements on these variables to make them more nearly norm 


6.39. Anacondas are some of the largest snakes in the world. Jesus Ravis and his ee 5 
searchers capture a snake and measure its (i) snout vent length (cm) or the length Ony 
the snout of the snake to its vent where it evacuates waste and (ii) weight (kilogram 
sample of these measurements in shown in Table 6.19. 

(a) Test for equality of means between males and females using « = OS. Appl: 
{arge sample statistic. 

(b} Is it reasonable to pool variances in this case? Explain. ‘§ 

(c} Find the 95% Boneferroni confidence intervals for the mean differences bet 
males and females on both length and weight. 


Exercises 357 


Table 6.19 Anaconda Data 


Snout vent 
Length 


Snout vent 
length Weight 


Weight Gender Gender 


18.50 F 176.7 3.00 M 
82.50 F 259.5 9.75 M 
23.40 F 258.0 10.07 M 
33.50 F 229.8 7.50 M 
69.00 F 233.0 6.25 M 
54.00 F 237.5 9.85 M 
24.97 F 268.3 10.00 M 
56.75 F 222.5 9.00 M 
23.15 F 186.5 3.75 M 
29.51 F 238.8 9.75 M 
19.98 F 257.6 9.75 M 
24.00 F 172.0 3.00 M 
70.37 F 244.7 10.00 M 
15.50 F 224.7 7.25 M 
63.00 F 231.7 9.25 M 
39.00 F 235.9 7.50 M 
53.00 F 236.5 5.75 M 
15.75 F 247.4 7,75 M 
44.00 F 223.0 5.75 M 
30.00 F 223.7 5.75 M 
34.00 F 212.5 7.65 M 
25.00 F 223.2 7.715 M 

9.25 F 225.0 5.84 M 
30.00 F 228.0 7.53 M 
15.25 F 215.6 5.75 M 
21.50 F 221.0 6.45 M 
57.00 F 236.7 6.49 M 
61.50 F 235.3 6.00 M 


Source: Data Courtesy of Jesus Ravis. 


6.40. Compare the male national track records in Table 8.6 with the female national track 
records in Table 1.9 using the results for the 100m, 200m, 400m, 800m and 1500m races. 
Treat the data as a random sample of size 64 of the twelve record values. 


(a) Test for equality of means between males and females using a = .05. Explain why it 
may be appropriate to analyze differences. 

(b) Find the 95% Bonferroni confidence intervals for the mean differences between 
male and females on all of the races. 


6.41, When cell phone relay towers are not working properly, wireless providers can lose great 
amounts of money so it is important to be able to fix problems expeditiously. A first step 
toward understanding the problems involved is to collect data from a designed experi- 
ment involving three factors. A problem was initially classified as low or high severity, 
simple or complex, and the engineer assigned was rated as relatively new (novice) or 
expert (guru). 


358 Chapter 6 Comparisons of Several Multivariate Means 


Two times were observed. The time to assess the problem and plan an attack and2z 

the time to implement the solution were each measured in hours. The data are piven inawg 

Table 6.20. x 
Perform a MANOVA including appropriate confidence intervals for important effects : 


Table 6.20 Fixing Breakdowns 


Problem Problem Engineer Problem Problem Total 
Severity Complexity Experience Assessment Implementation Resolution — 


Leve} Level Level Time Time 


Simple Novice 3.0 
Low Simple Novice 2.3 
Low Simple Guru 1.7 
Low Simple Guru 12 
Low Complex Novice 6.7 
Low Complex Novice TA 
Low Complex Guru 5.6 
Low Compiex Guru 45 
High Simple Novice 45 
High Simple Novice 4.7 
High Simple Guru 3.1 
High Simple Guru 3.0 
High Complex Novice 79 
High Complex Novice 6.9 
Complex Guru 5.0 
Complex Guru 


Source: Data courtesy of Dan Porter. 


References 


1. Anderson, T. W. An Introduction to Multivariate Statistical Analysis (3rd ed.). New York: 
John Wiley, 2003. 

2. Bacon-Shone, J., and W. K. Fung. “A New Graphical Method for Detecting Single and 
Multiple Outliers in Univariate and Muttivariate Data.” Applied Statistics, 6, no. 2 
(1987), 153-162. 

3, Bartlett, M. S. “Properties of Sufficiency and Statistical Tests.” Proceedings of the Royal 
Society of London (A), 160 (1937), 268-282. 

4. Bartlett, M. S. “Further Aspects of the Theory of Multiple Regression.” Proceedings of 
the Cambridge Philosophical Society, 34 (1938), 33-40. 

5. Bartlett, M. S. “Multivariate Analysis.” Journal of the Royal Statistical Society Supple- 
ment (B), 9 (1947), 176-197. 

6. Bartlett, M. S. “A Note on the Multiplying Factors for Various x? Approximations.” 
Journal of the Royal Statistical Society (B), 16 (1954), 296-298. 

7. Box, G. E. P., “A General Distribution Theory for a Class of Likelihood Criteria.” 
Biometrika, 36 (1949), 317-346. 

8. Box, G. E. P., “Problems in the Analysis of Growth and Wear Curves.” Biometrics, 6 
(1950), 362-389. 


References 359 


. Box, G.E. P., and N. R. Draper. Evolutionary Operation: A Statistical Method for Process 


Improvement. New York: John Wiley, 1969. 


. Box, G. E. PB, W. G. Hunter, and J. S. Hunter. Statistics for Experimenters (2nd ed.). 


New York: John Wiley, 2005. 


. Johnson, R. A. and G. K. Bhattacharyya. Statistics: Principles and Methods (Sth ed.). 


New York: John Wiley, 2005. 


. Jolicoeur, P., and J. E. Mosimann. “Size and Shape Variation in the Painted Turtle: 


A Principal Component Analysis.” Growth, 24 (1960), 339-354. 


. Khattree, R. and D. N. Naik, Applied Multivariate Statistics with SAS® Software (2nd 


ed.). Cary, NC: SAS Institute Inc., 1999. 


. Kshirsagar, A. M., and W. B. Smith, Growth Curves. New York: Marcel Dekker, 1995. 
. Krishnamoorthy, K., and J. Yu. “Modified Ne] and Van der Merwe Test for the Multivari- 


ate Behrens-Fisher Problem.” Statistics & Probability Letters, 66 (2004), 161-169. 


. Mardia, K. V., “The Effect of Nonnormality on some Multivariate Tests and Robustnes 


to Nonnormality in the Linear Model.” Biometrika, 58 (1971), 105-121. 


. Montgomery, D. C. Design and Analysis of Experiments (6th ed.). New York: John Wiley, 


2005. 


. Morrison, D. F. Multivariate Statistical Methods (4th ed.). Belmont, CA: Brooks/Cole 


Thomson Learning, 2005. 


. Nel, D. G., and C. A. Van der Merwe. “A Solution to the Multivariate Behrens-Fisher 


Problem.” Communications in Statistics—Theory and Methods, 15 (1986), 3719-3735. 


. Pearson, E. S., and H. O. Hartley, eds. Biometrika Tables for Statisticians. vol. 1. 


Cambridge, England: Cambridge University Press, 1972. 


. Potthoff, R. E and S. N. Roy. “A Generalized Multivariate Analysis of Variance Model 


Useful Especially for Growth Curve Problems.” Biometrika, 51 (1964), 313-326. 


. Scheffé, H. Tke Analysis of Variance. New York: John Wiley, 1959. 
. Tiku, M. L., and N. Balakrishnan. “Testing the Equality of Variance—Covariance Matrices 


the Robust Way.” Communications in Statistics—Theory and Methods, 14, no. 12 (1985), 
3033-3051. 


. Tiku, M. L., and M. Singh. “Robust Statistics for Testing Mean Vectors of Multivariate 


Distributions.” Communications in Statistics—Theory and Methods, 11, no. 9 (1982), 
985-1001. 


. Wilks, S. S. “Certain Generalizations in the Analysis of Variance.” Biometrika, 24 (1932), 


471-494, 


MULTIVARIATE LINEAR 
REGRESSION MODELS 


- feeente actchl 


7.1 Introduction 


Regression analysis is the statistical methodology for predicting values of one or 
more response (dependent) variables from a collection of predictor (independent) 
variable values. It can also be uSed for assessing the effects of the predictor variables 
on the responses. Unfortunately, the name regression, culled from the title of the: 
first paper on the subject by F. Galton [15], in no way reflects either the importance 
or breadth of application of this methodology. 

In this chapter, we first discuss the multiple regression model for the predic- , 
tion of a single response. This mode) is then generalized to handle the prediction 
of several dependent variables. Our treatment must be somewhat terse, as a vast 
literature exists on the subject. (If you are interested in pursuing regression 
analysis, see the following books, in ascending order of difficulty: Abraham and 
Ledolter [1], Bowerman and O’Connel! [6], Neter, Wasserman, Kutner, and 
Nachtsheim [20], Draper and Smith [13], Cook and Weisberg [11}, Seber [23}, 
and Goldberger [16].) Our abbreviated treatment highlights the regression 
assumptions and their consequences, alternative formulations of the regression 
model, and the general applicability of regression techniques to seemingly dif- 
ferent situations. 


7.2 The Classical Linear Regression Model 


Let 21, 22,...,2, be r predictor variables thought to be related to a response variable 
Y. For example, with r = 4, we might have 


Y = current market value of home 


360 


The Classical Linear Regression Mode! 361 


and 


Z, = square feet of living area 


It 


22 = location (indicator for zone of city) 


23 = appraised value last year 


I 


24 = quality of construction (price per square foot) 


The classical linear regression model states that Y is composed of a mean, which de- 
pends in a continuous manner on the 2;’s, and a random error e, which accounts for 
measurement error and the effects of other variables not explicitly considered in the 
model. The values of the predictor variables recorded from the experiment or set by 
the investigator are treated as fixed. The error (and hence the response) is viewed 
as a random variable whose behavior is characterized by a set of distributional 
assumptions. 

Specifically, the linear regression model with a single response takes the form 


Y = Bo + Biz +++: + Bz, +e 
[Response] = [mean (depending on 2), Z2,...,2z,)] + [error] 
The term “linear” refers to the fact that the mean is a linear function of the un- 
known parameters Bq, 8;,-.., 8. The predictor variables may or may not enter the 
model as first-order terms. 


With 7 independent observations on Y and the associated values of z;, the com- 
plete model becomes 


Y) = Bo + By211 + Bo2y2 + +++ + Bear + 


Y, = Bo + Biz21 + B2z22 + +--+ BZ, + & (7-1) 
Y,, = Bo + Biz + B2Zn2 +--+ + B,Znp + Ep 

where the error terms are assumed to have the following properties: 

1. E(e;) = 0; 

2. Var(e;) = a” (constant); and (7-2) 


3. Cov(e;,e,) = 0,7 # k. 


In matrix notation, (7-1) becomes 


Y; Loa 212 ++ Zr || Bo Ey 
Yo} _ | 1 21 %2 + Zar || Br 4 | 2 
Y,, lo 2n1 22 °°° nr B, En 


or 


= 2 B +e 
(»x1) (2X(r4+1)) ((r41) x1) (7x1) 
and the specifications in (7-2) become 
1. E(e) = 0; and 
2. Cov(e) = E(ee') = o° I. 


362 Chapter 7 Multivariate Linear Regression Models 


Note that a one in the first column of the design matrix Z is the multiplier of the 
constant term {p. It is customary to introduce the artificial variable z;9 = 1, so that 


Bo + Bizj ++ + BpZr = Bozo + Biz + + Brz;e 
Each column-of Z consists of the n values of the corresponding predictor variable, 
while the jth row of Z contains the values for all predictor variables on the jth trial. 


4 


Classical Linear Regression Model 2 
Y Zz B + € , . 5 
(nx1) Se aay ((rt1I)x1) (nx) : 5 
a = 29 2 
E(e) = cae Cov(e) ae (7-3) 


where B and o” are unknown parameters and the design matrix Z has jth row... 
[Z;o, Zlseery Zr]. 


Although the error-term assumptions in (7-2) are very modest, we shall later need 
to add the assumption of joint normality for making confidence statements and 
testing hypotheses. 

We now provide some examples of the linear regression model. 


Example 7.1 (Fitting a straight-line regression model) Determine the linear regression 
model for fitting a straight line 


Mean response = E(Y) = Bo + Biz, 


to the data 


‘= 


Before the responses Y’ = [Y;,¥2,..., Ys] are observed, the errors e' = 
[€1, €2,.-.,€5] are random, and we can write 


Y=ZBt+e 
where 
Y; y1 21 &} 
y=|"], z=|1 2], p~[* e=|? 
: aa By : 


Ys 1 25) &5 


The Classical Linear Regression Mode! 363 


The data for this model are contained in the observed response vector y and the 
design matrix Z, where 


1 1 0 
4 1 1 
y=/|3], Z=]1 2 
8 1 3 
9 1 4 


Note that we can handle a quadratic expression for the mean response by intro- 
ducing the term 22, with z,; = z}. The linear regression model for the jth trial in 
this latter case is 


Y; = Bo + Bij + Bozj2 + & 
or 
2s 2 
Y; = Bo + Biz + Bozj + & = 


Example 7.2 (The design matrix for one-way ANOVA as a regression model) 
Determine the design matrix if the linear regression model is applied to the one-way 
ANOVA situation in Example 6.6. 

We create so-called dummy variables to handle the three population means: 
My = e+ 7], fy = w+ 72, and py = pw + 73. We set 


1 if the observation is 1 if the observation is 
q= from population 1 22 = from population 2 

0 otherwise 0 otherwise 

1_ ifthe observation is : 


32 from population 3 
0 otherwise 


and Bo = 4, By = 7), B2 = 72, Bs = 73. Then 
Y; = Bo + Biz: + Bozja + Bszj3 + &;, j= 1,2,...,8 


where we arrange the observations from the three populations in sequence. Thus, we 
obtain the observed response vector and design matrix 


9 1100 
6 1100 
9 1100 
Y = 0}. ait BE 
(8x1) 2 (8x4) 1010 
3 10041 
1 1001 
2 1001 = 


The construction of dummy variables, as in Example 7.2, allows the whole of 
analysis of variance to be treated within the multiple linear regression framework. 


364 Chapter 7 Multivariate Linear Regression Models 


7.3 Least Squares Estimation 


One of the abjectives of regression analysis is to develop an equation that will allo 
the investigator to predict the response for given values of the predictor variable 
Thus, it is necessary to “fit” the model in (7-3) to the observed y, corresponding 
the known values 1, z,;,...,z;,- That is, we must determine the values for th 
regression coefficients B and the error variance o* consistent with the avellablc dat 
Let b be trial values for 8. Consider the difference y,; — a Dy 21 - ~ b,2; 
between the observed response y, and the value by + 5,2; + «++ + 5,2), état wouildé 
. expected if b were the .“true” parameter vector. Typically, the differencés 
— by ~ 521 — +++ — 5,2;, will not be zero, because the response fluctuates (ina Fe 
aie characterized by the error term assumptions) about its expected value. Thé# 
method of least squares selects b so as to minimize the sum of the squares of the! 
differences: : 


S(b) 


ll 


Dy (94 — Bo ~ Braj — 02 = Brzjr)? 
Al 


(y — Zb)'(y — Zb) 
The coefficients b chosen by the least squares criterion are called least squares esti--- 
mates of the regression parameters f. They will henceforth be denoted by B to em- - 


phasize their role as estimates of B. 2s 
The coefficients B are consistent. with the data in the sense that they pias 


estimated (fitted) mean responses, Bo + Bz teeet BeZr> the sum of whos 
squares of the differences from the observed y, is as ariall as possible. The deviatio 


8; = yj — Bo ~ Bra Bt, FEL Ce 


are called residuals. The vector of residuals € = y — ZB contains the information * 
about the remaining unknown parameter a”. (See Result 7.2.) 


W 


Result 7.1. Let Z have full rank r + 1 <n! The least squares estimate of B in™ 
(7-3) is given by 


= (ZZ) 'Z'y 


Let § = ZB = Hy denote the ne values of y, where H = Z(Z'Z) 'Z! is called” 
“hat” matrix. Then the residuals 


é=y~ 9 = (I - Z(Z'Z) Z'ly = (1- Wy 
satisfy Z'é = 0 and y'é = 0. Also, the 


n a a 
residual sur of squares = > (yj; — Bo — Biz — — Bz) = 8 
j=l 


= y'[I - Z(Z’Z) “Z'ly = y'y - ZB 


Nf Z is not full rank, (Z'Z)~" is replaced by (Z'Z)", a generalized inverse of ZZ. (s 
Exercise 7.6.) 


Least Squares Estimation 365 


Proof. Let B = (Z'Z) 'Z'y as asserted. Then &€=y~yuy- ZB = 
[I — Z(Z’Z) ‘Z']y. The matrix [I — Z(Z’Z) 'Z’} satisfies 


1. [I - Z(2'Z)"Z'] = [1 - Z(Z'Z)'Z'}_ (symmetric); 
2. (I - Z(Z'Z) Zk — Z(Z'Z) Z| 
=1-2Z(Z'Z) "2! + Z(Z'Z)Z2(Z'Z) TZ’ (74) 
= [I — Z(Z'Z)°Z'] (idempotent); 
3, ZL - Z(Z'Z) Z') = Z' - Z' =0. 
Consequently, Z'é = Z'(y — 9) = Z'[I — Z(Z/Z)"Z']y = 0,so0f'é = B'Z’ é =0. 
Additionally, @’@ = y'[I-Z(Z’'Z) *Z'][I-Z(Z'Z)°Z']y = y'[I- Z(Z'Z) Zly 
=yy-yZ B. To verify the expression for B , we write 
y — Zb=y ~ ZB + ZB - Zb = y — ZB + Z(B - b) 
so 
S(b) = (y — Zb)’(y — Zb) 
= (y ~ ZB)'(y ~ ZB) + (B ~ b)'Z'Z(B — b) 
+ 2(y — ZB)'Z(B ~ b) 
= (y — ZB)'(y — ZB) + (B ~ b)’Z'Z(B - b) 
since (y ~ ZB) Z=8Z=0’. The first term in 5(b) does not depend on b and the 
second is the squared length of Z(B ~ b). Because Z has full rank, Z(B —b) #0 
if B # hi so the minimum aon) of squares is unique and occurs for b = B = 
(Z’Z) ‘Z’y. Note that (Z'Z) exists since Z’Z has rank r + 1 < n. (If Z'Z is not 


of full rank, Z’Za = 0 for some a ¥ 0, but then a’ Z’Za = 0 or Za = 0, which con- 
tradicts Z having full rank r + 1.) = 


Result 7.1 shows how the least squares estimates B and the residuals € can be 
obtained from the design matrix Z and responses y by simple matrix operations. 


Example 7.3 (Calculating the least squares estimates, the residuals, and the residual 
sum of squares) Calculate the least square estimates B, the residuals &, and the 
residual sum of squares for a straight-line model 


Yi = Bo + Bizji + & 


fit to the data 


366 Chapter 7 Multivariate Linear Regression Models 


We have 
Z' y ZZ (Z'z)' Z'y 
1 
4 
111141 3 5 10 6 -2 25 
0123 4 8 10 30 2 Jl 70 
9 
Consequently, 


«CB fen fi SO Se 28 | 
a-[R]-e@arey-L 2 ale] Le 


and the fitted equation is 
y=1+2z 


The vector of fitted (predicted) values is 


: 1 0 1 
: 1 1 1 3 
y= ZB=\1 2 Be 5 
1 3 7 
14 9 
1 1 0 
4 3 1 
sO é=y-y=|/3[-|5]=] -2 
8 7 1 
9 9 0 
The residual sum of squares is 
0 
1 
éé=(0 1 -2 1 O]] -2(=0+ P+ (-2%+27+0=6 am 
1 
0 


Sum-of-Squares Decomposition 


n 
According to Result 7.1, y’é = 0, so the total response sum of squares y'y = Dy 
satisfies rl 


Y¥=(Yty-HNGty-H=-Gt+OY H+ =Hpt+ KE (7-7) 


Least Squares Estimation 367 


Since the first column of Z is 1, the condition Z'é = 0 includes the requirement 
n n n = = 
0=VE= De = Dy — Dd 35; or y = J. Subtracting ny? = n(p)* from both 
i! j=l j=! 
sides of the decomposition in (7-7), we obtain the basic decomposition of the sum of 
squares about the mean: 


yy — ny =y'¥ ~ n(y) + ee 


or 
n 2 a 2 un 
> y=) = SO-H + TF (7-8) 
j=l j=1 j=l 
total sum Tegression : 
ofsquares | = ee of an xe) 
aboutmean squares sum of squares 


The preceding sum of squares decomposition suggests that the quality of the models 
fit can be measured by the coefficient of determination 


R=1 j=l avel (7-9) 


The quantity R? gives the proportion of the total variation in the y;’s “explained” 
by, or attributable to, the predictor variables z,, z2,...,2Z,. Here R? (or the multiple 
correlation coefficient R = + VR?) equals 1 if the fitted equation passes through all 
the data points, so that é; = 0 for all j. At the other extreme, R? is 0 if By = y and 
B, = B. =-:: = B, = 0. In this case, the predictor variables z,,Z2,...,z, have no 
influence on the response. 


Geometry of Least Squares 


A geometrical interpretation of the least squares technique highlights the nature of 
the concept. According to the classical linear regression model, 


1 211 Z1r 
1 

Mean response vector = E(Y) = ZB = Bo| . | + Bi oe +++: + B, Fd 
1 Zi Zar 


Thus, E(Y) is a linear combination of the columns of Z. As B varies, ZB spans the 
model plane of all linear combinations. Usually, the observation vector y will not lie 
in the model plane, because of the random error e¢; that is, y is not (exactly) a linear 
combination of the columns of Z. Recall that 


Y = ZB + E 
vector 
response : error 
in model 
vector vector 
plane 


368 Chapter 7 Multivariate Linear Regression Models 


»~2 


Figure 7.1 Least squares as a 
projection for n = 3,r = 1. 


Once the observations become available, the least squares solution is derived 
from the deviation vector 


y — Zb = (observation vector) ~ (vector in model plane) 


The squared length (y ~ Zb)’(y — Zb) is the sum of squares S(b). As illustrated in 
Figure 7.1, S(b) is as small as possible when b is selected such that Zb is the point in 
the madel plane closest to y. This point occurs at the tip of the perpendicular pro- 
jection of y on the plane. That is, for the choice b = 8, ¥ = ZB is the projection of 
y on the plane consisting of all linear combinations of the columns of Z. The residual 
vector € = y — Vis perpendicular to that plane. This geometry holds even when Z is 
not of ful] rank. 

When Z has full rank, the projection operation is expressed analytically as 
multiplication by the matrix Z(Z’ Z) 17’ To see this, we use the spectral decompo- 
sition (2-16) to write 


Z'Z = Ayepe) + Agere? +--+ + Anrep 4 1741 


where A, = Ap =--- = A,4, > Oare the eigenvalues of Z'Z and e;, e2,..., €-41 are 
the corresponding eigenvectors. If Z is of full rank, 


(Z' Z)" = ace + Legh + eee ol eases 
2 rt) 
Consider q; = Aj /*Ze;, which isa linear combination of the columns of Z. Then 4/q¢ 
= Aap eZ Ze, = Apap ejAye, = 0 ifi # k orl ifi = k. That is, the r + 1 
vectors q; are mutually perpendicular and have unit length. Their linear combina- 
tions span the space of all linear combinations of the columns of Z. Moreover, 
rt] r+1 


Z(Z'Z) 2’ = SA Zee'Z' = Y gigi 
i=l 


i=) 


Least Squares Estimation 369 


According to Result 2A.2 and Definition 2A.12, the projection of y on a linear com- 
rti r+1 A 

bination of {q1,92,--., 4-41} is > (aly) a: = (‘S aay = Z(Z'Z) 'Z'y = ZB. 
i=l =1 


Thus, multiplication by Z (Z'Z)'Z' projects a vector onto the space spanned by the 
columns of Z.” 

Similarly, [I — Z(Z'Z) ‘Z'} is the matrix for the projection of y on the plane 
perpendicular to the plane spanned by the columns of Z. 


Sampling Properties of Classical Least Squares Estimators 


The least squares estimator B and the residuals é have the sampling properties 
detailed in the next result. 


Result 7.2. Under the general linear regression model in (7-3), the least squares 
estimator B = (Z'Z) 'Z'Y has 


E(B) = B and Cov(B) = 0°(Z'Z)! 
The residuals é have the properties 
E(é) =0 and Cov(é) = o%[{ - Z(Z'Z)'Z'] = o7[I - Hi] 
Also, E(é'2) 


I 


(n — r — 1)o’, so defining 


. ee Y'[I-Z(Z'Z)'Z']y Y'(I—- HY 
s=- = So 
n-—-(r+1) n-r-1l n-r-1 


we have 
E(s?) = 0? 


Moreover, B and é are uncorrelated. 


Proof. (See webpage: www.prenhall.com/statistics) = 


The least squares estimator B Possesses a minimum variance property that was 
first established by Gauss. The following result concerns “best” estimators of linear 
parametric functions of the form ce’ B = coBy + ¢,B) +--+: + ¢,B, for any c. 


Result 7.3 (Gauss” least squares theorem). Let Y = ZB + «, where E(e) = 0, 
Cov(e) = oI, and Z has full rank r + 1. For any c, the estimator 


c'B = coBo + 1B) +--+ GB, 


ryt 
7If Z is not of full rank, we can use the generalized inverse (Z'Z) = >, Ajte,e;, where 
i=) 
Ay 2 Ag B+ B Ang > O= Ana? = "°° = Apgi, aS described in Exercise 7.6. Then Z(Z'Z) 2’ 
nth 
= > q:4j has rank 7, + 1 and generates the unique projection of y on the space spanned by the linearly 
1 
independent columns of Z. This is true for any choice of the generalized inverse. (See {23].) 
3Much Jater, Markov proved a less general result, which misled many writers into attaching his 
name to this theorem. 


fered 


370 Chapter 7 Multivariate Linear Regression Models 


of c’ 8 has the smallest possible variance among all linear estimators of the form 
a’'Y = a,Y, + a¥,+-:- + anY, 


that are unbiased for ec’ B. 


ji cbaticieita isid ihe .. 


Proof. For any fixed c, let a’¥Y be any unbiased estimator of ec’. Theil 
E(a'Y) = c’B, whatever the value of f. Also, by assumption, E(a’ Y) = 
E(a'ZB + a'e) = a'ZB. Equating the two expected value expressions yielgse 
a’ZB = c'B or- ‘(e’ — a'Z)B = 0 for all £, including the choice B = (c' ~ a Dye 
This implies that ce’ = a’Z for any unbiased estimator. 

Now, c ‘B= =¢'(Z' Z)"2z' Y =a*’Y with a* = Z(Z’Z) ‘ce. Moreover, tring 
Result 7.2 E(B) = B,soe 'B = a*’Y is an unbiased estimator of c’ B. Thus, for “ig 
a Satisfying the unbiased requirement ec’ = a'Z, 


Var(a’Y) = Var(a’ZB + a’e) = Var(a’e) = a'lo’a 
= o*(a — a* + a*)'(a— a* + a*) 


= o*[(a — a*)'(a ~ a*) + a*’a*] 


oe 


RBibias, 


a) 
~ ie 


since (a — a*)'a* = (a ~ a*)'Z(Z'Z) ‘c = 0 from the condition (a — a*)'Z, = 
a’'Z — a*'Z = c' — c’ = 0’. Because a* is fixed and (a — a*)'(a — a®*) i is positive 
unless a = a*, Var(a’Y) is minimized by the choice a*'¥ = c'(Z'Z) ZY =e 'B. 
a. 
This powerful result states that substitution of B for B leads to the best estima-* 
tor of c' B for any c of interest. In statistical terminology, the estimator c’p is called | 
the best (minimum-variance) linear unbiased estimator (BLUE) of ¢’ B. 


7.4 Inferences About the Regression Model 


We describe inferential procedures based on the classical linear regression model in 
(7-3) with the additional (tentative) assumption that the errors € have a normal dis- 
tribution. Methods for checking the general adequacy of the model are considered 
in Section 7.6. 


inferences Concerning the Regression Parameters 
Before we can assess the importance of particular variables in the regression function 
E(Y) = Bo + Bia +--+ Br2, (7-10) 


we must determine the sampling distributions of B and the residual sum of squares, 
&'&. To do So, we shall assume that the errors ¢ have a normal distribution. 


Result 7.4. Let Y = ZB + e, where Z has full rank r + 1 and « is distributed as : 
N,(0,071). Then the maximum likelihood estimator of 8 is the same as the least J 3 


squares estimator B. Moreover, 


B = (Z'Z)'Z'Y isdistributedas N,.(B,0°(Z'Z) ') 


Inferences About the Regression Model 37) 


and is distributed independently of the residuals € = Y — ZB. Further, 


2 


no’ = &'& isdistributedas o*y2_,_) 


where ¢* is the maximum likelihood estimator of o?. 
Proof. (See webpage: www.prenhall.com/statistics) = 


A confidence ellipsoid for 8 is easily constructed. It is expressed in terms of the 
estimated covariance matrix s?(Z’Z)’, where s* = &’&/(n — r — 1). 


Result 7.5. Let Y = ZB + e, where Z has full rankr + 1 and eis N,(0, 071). Then 
a 100(1 — a) percent confidence region for B is given by 


(B-B)'Z'Z(B—B) = (r + 1)5?Farn—r-i(a) 


where F,.1 »~--1(@) is the upper (100a)th percentile of an F-distribution with r + 1 
andn-r-—1ldf. 

Also, simultaneous 100(1 — a@) percent confidence intervals for the B; are 
given by 


Bi + VVar(Bi) Vir + DFrinri(@), §=0,1.057 
where Var( Bi) is the diagonal element of s*( Z'Z)" corresponding to B;. 


Proof. Consider the symmetric square-root matrix (Z'Z)'”. [See (2-22).] Set 
V = (Z'Z)'"(B — B) and note that E(V) = 0, 


Cov(V) = (Z'Z)"” Cov(B)(Z'Z)"" = o(Z'Z)"(Z'Z)\(Z'Z)'" = oI 


and V is normally distributed, since it consists of linear combinations of the Bis. 
Therefore, V'V = (B ~ B)'(Z'Z)'"(Z'Z)'"(B — B) = (B - BY (Z'Z)(B ~ B) 
is distributed as o”y7,,. By Result 7.4 (n —~r—1)s* = &'é is distributed as 
o X51) independently of B and, hence, independently of V. Consequently, 
[rta/(r + DV [xe---1/(n -— rp — 1) = [WV (r + 1))/s? has an F41,,—r-1 distri- 
bution, and the confidence ellipsoid for # follows. Projecting this ellipsoid for 
(B — B) using Result 5A.1 with A! = Z’'Z/s?, c? = (r +1) Frstn-r-1(@), and uw’ = 


[0,...,0,1,0,...,0] yields |B; — Bi| = V(r + 1)Fain-r-1(a) V Var(8,), where 


Var (B;) is the diagonal element of s*(Z'Z)! corresponding to Bi. = 


The confidence ellipsoid is centered at the maximum likelihood estimate B, 
and its orientation and size are determined by the eigenvalues and eigenvectors of 
Z'Z. If an eigenvalue is nearly zero, the confidence ellipsoid will be very long in the 
direction of the corresponding eigenvector. 


372 Chapter 7 Multivariate Linear Regression Models 
Practitioners often ignore the “simultaneous” confidence property of the inter 


val estimates in Result 7.5. Instead, they replace (r + 1)F.+in—s-1(@) with the one. 
at-a-time f value ¢,-,~,(a/2) and use the intervals ; 


B = tre $) Vv Var (B;) (7-11), 


when searching for important predictor variables. 


Example 7.4 (Fitting a regression model to real-estate data) The assessment data jn. 
Table 7.1 were gathered from 20 homes in a Milwaukee, Wisconsin, neighbarhood.- 
Fit the regression model : 


Y¥; = Bo + Biz + Botjo + & 


where z, = total dwelling size (in hundreds of square feet), z7 = assessed value (in 
thousands of dollars), and Y = selling price (in thousands of dollars), to these data 
using the method of least squares. A computer calculation yields 


5.1523 
(Z'Z)' =| 2544 0512 
-.1463 -.0172 .0067 


Table 7.1 Real-Estate Data 
ener 
zy 22 Y 
Total dwelling size Assessed value Selling price 
(100 ft?) ($1000) ($1000) 
15.31 57.3 74.8 
15.20 63.8 74.0 
16.25 65.4 72.9 
14.33 57.0 70.0 
14.57 : 63.8 74.9 
17.33 63.2 76.0 
14.48 60.2 72.0 
14.91 S77 73.5 
15.25 564 74.5 
13.89 556 73.5 
15.18 62.6 71.5 
14.44 63.4 71.0 
14.87 60.2 78.9 
18.63 67.2 86.5 
15.20 S71 68.0 
25.76 89.6 102.0 
19.05 68.6 84.0 
15.37 60.1 69.0 
18.06 66.3 88.0 
16.35 65.8 76.0 


Inferences About the Regression Model 373 


and 


30.967 
B = (Z'Z)'Z'y =| 2.634 
045 


Thus, the fitted equation is 


y = 30.967 + 2.634z, + .045z, 
(7.88) (.785) (.285) 


with s = 3.473. The numbers in parentheses are the estimated standard deviations 
of the least squares coefficients. Also, R? = .834, indicating that the data exhibit a 
strong regression relationship. (See Panel 7.1, which contains the regression analysis 
of these data using the SAS statistical software package.) If the residuals & pass 
the diagnostic checks described in Section 7.6, the fitted equation could be used 
to predict the selling price of another house in the neighborhood from its size 


PANEL 7.1 SAS ANALYSIS FOR EXAMPLE 7.4 USING PROC REG. 


title ‘Regression Analysis’; 

data estate; 

infile 'T7-1.dat’; 

input 21 22 y; 

proc reg data = estate; 
model y = 21 22; 


PROGRAM COMMANDS 


Model: MODEL 1 ; OUTPUT 
Dependent Variable: 


Analysis of Variance 


Sum of Mean 

Source DF Squares Square F value Prob > F 
Model 2 1032.87506 516.43753 42.828 0.0001 
Error 17 204.99494 12.05853 
C Total 19 1237.87000 

Root MSE 3.47254 R-square 0.8344 

Deep Mean 76.55000 Adj R-sq 0.8149 

CNV. 4.53630 


Parameter Estimates 


Parameter Standard T for HO: 
Variable DF Estimate Error Parameter =0 Prob > ITI 
INTERCEP 1 30.966566 7.88220844 3.929 0.0011 
z1 1 2.634400 0.78559872 3.353 0.0038 
22 1 0.045184 0.28518271 0.158 0.8760 


374 Chapter 7 Multivariate Linear Regression Models 


and assessed value. We note that a 95% confidence interval for B2 [see (7- ~14)] is. 


given by 
Bo £ ty7(.025) V Var (Bp) = .045 + 2.110(.285) 
or ‘s 7 


(—.556, 647) 


Since the confidence interval includes 2 = 0, the variable z. might be dropped. 
from the regression model and the analysis repeated with the single predictor varj.- 
able 2;. Given dwelling size, assessed value seems to add little to the prediction of _ 


4 


selling price. s 


Likelihood Ratio Tests for the Regression Parameters 


Part of regression analysis is concerned with assessing the effects of particular pre- 
dictor variables on the response variable. One null hypothesis of interest states that 
certain of the z,’s do not influence the response Y. These predictors will be labeled - 


2q+10Zq4+2s--+, Zp. The statement that z,,1,2,+2,---,2, do not influence Y translates 
into the statistical hypothesis 
Ao: Basi = Barz ="*° = Br = 0 or Hy: Ba) = 0 (7-12) 
where Bo) = [Bq+1,B9+2, w+) Br]: 
Setting 


(( a , 1) 
Z= Z, 3: Zz = | Meth) 
En | 8 Bo) 


we can express the general linear model as 
= 
¥=Zp + e=[211%)] sayeteee | +tEe= ZiBq) + Z2B(2) + € 


Under the null hypothesis Ho: By) = 0, Y = Z)Bq) + ©. The likelihood ratio test 

of Hp is based on the 

Extra sum of squares = SS,¢s(Z1) — SSyes(Z) (7-13) 

= (y — Z,By))'(y - Z:Ba)) — (y — ZB)'(y ~ ZB) 

where Bay = (ZjZ) ‘Ziy. 

Result 7.6. Let Z have full rank r + 1 and « be distributed as N,(0,0°1). The 

likelihood ratio test of Ho: Bi) = 9 is Sat to a test of Hy based on the 

extra sum of squares in (7-13) and s? = (y — ZB) (y - ZB)/(n —~r-1). In 

particular, the likelihoad ratio test rejects Hp if 
(SStes(Z1) = SSres(Z))/(r q) 

s 
where F,_4,,~r~1(a) is the upper (100a)th percentile of an F-distribution with r — 9 
andn-—r—idt 


> F,~q.n-r-1(@) 


Inferences About the Regression Model 375 


Proof. Given the data and the normal assumption, the likelihood associated with 
the parameters B and 0° is 
L(p,02) = —1—— 200-20 ype? <1 en 
(20 yon ~ (2a) 6 n 


with the maximum occurring at B = (Z'Z) ‘Z'y and 6? = (y - ZB)'(y = ZB)/n. 
Under the restriction of the null hypothesis, Y = ZB) + € and 


1 
max, L jo?) = ——— eV? 
Bi1).2? (Bay) (20)"?67 


where the maximum occurs at Buy = (Z4Z,) 'Ziy. Moreover, 


=(y- Z: Bay) (y = ZBu)/n 


Rejecting Hp: B(2) = 0 for small values of the likelihood ratio 


fax, L(Bu).2") 7 (4 ae (# + 6} - 62\02_ (; a4 a 
max L(B, 0°) e oe oe 


is equivalent to rejecting Hy for large values of (63 — 6*)/6 or its scaled version, 


n(oy a &)/(r — q) _ (SSres(Z1) — SSyes(Z) )/(r Ee. q) =F 
no?/(n — r — 1) s? 


The preceding F-ratio has an F-distribution with r ~ g andn — r ~ 1d.£ (See [22] 
or Result 7.11 with m = 1.) = 


Comment. The likelihood ratio test is implemented as follows. To test whether 
all coefficients in a subset are zero, fit the model with and without the terms corre- 
sponding to these coefficients. The improvement in the residual sum of squares (the 
extra sum of squares) is compared to the residual sum of squares for the full model 
via the F-ratio. The same procedure applies even in analysis of variance srUaHeNs 
where Z is not of full rank.* 

More generally, it is possible to formulate null hypotheses concerning r — q lin- 
ear combinations of 8 of the form Hy: CB = Ag. Let the (r ~ g) X (r + 1) matrix 
C have full rank, let Ag = 0, and consider 


Ap: cB =0 
( This null hypothesis reduces to the previous choice when C = (@ : I }) 
i (r-q)X(r-9) 


‘In situations where Z is not of full rank, rank(Z) replaces r + 1 and rank(Z,) replaces g + 1 in 
Result 7.6, 


376 Chapter 7 Multivariate Linear Regression Models 


Under the full model, C8 is distributed as N,_,(CB, °C (Z'Z) °C’). We reje 
Ho: CB = 0 at level a if 0 does not lie in the 100(1 — a)% confidence ellipsoid 


CB. Equivalently, we reject Hp: CB = 0 if “4 
ay pulang # 

(CB)'(C(Z'Z) C’) (CB) # 

as: 2 > (r a Q)F,-gn-r-1(@) (7-14) 


s 


where s* = (y — ZB)'(y - ZB)/(n~r -—1) and F_4,---1(a) is the upper: 
(100a@)th percentile of an F-distribution with r — g and n — r — 1 d.f. The test ing 
(7-14) is the likelihood ratio test, and the numerator in the F-ratio is the extra residual,; 
sum of squares incurred by fitting the model, subject to the restriction that CB = 9” 
(See [23}). 8 

The next example illustrates how unbalanced experimental designs are easily; 
handled by the general theory just described. : 


Example 7.5 (Testing the importance of additional) predictors using the extra sum-of. 
squares approach) Male and female patrons rated the service in three establish- 
ments (locations) of a large restaurant chain. The service ratings were converted 
into an index. Table 7.2 contains the data for n = 18 customers. Each data point in 
the table is categorized according to location (1, 2, or 3) and gender (male = 0 and 
female = 1). This categorization has the format of a two-way table with unequal 
numbers of observations per cell. For instance, the combination of location 1 and 
male has 5 responses, while the combination of location 2 and female has 2 respons- 
es. Introducing three dummy variables to account for location and two dummy vari- 
ables to account for gender, we can develop a regression model linking the service 
index ¥ to location, gender, and their “interaction” using the design matrix 


Table 7.2 Restaurant-Service Data 


Location Service (Y) 


pre OOrRrOCCOCOFFOCOOCOCS 


1 
1 
1 
1 
1 
1 
1 
2 
2 
2 
2 
2 
2 
2 
3 
3 
3 
3 


Inferences About the Regression Mode! 377 


constant location gender interaction 
oo OOOO TCS =XS 
f 4 100 10 100000 
1 £070" A Oo bose ab 
1 100 1 0 1000 0 0 5 responses 
1 100 10 100000 
1 100 10 100000 
1 £ O30 901. 201910. 0, 0902 | Noo 
1 1 °OG0" OC Oo £060 Oe 
1 010 10 001000 
1 010 1060 001000 

Z=|° 1 010 10 001000 5 responses 
1 010 10 001000 
1 010 10 001000 
1 O10. 0A COO AO Pe oe 
1 Ot..O OE: “0° 0 Ae oo. 
1 OOS S130: 2050 MO 09] Ui nde ot 
1 00: “2 O° 00 0-02.90 ee 
1 DEUS “0: A 2080 Ox O OF Mis os 
1 0 Ot Sot “Oo eo o4 P 


The coefficient vector can be set out as 


B’ = (Bo, Bi, B2, Bs, 71,72) Y115 125 ¥21> ¥22» ¥31> 32) 


where the B;’s (i > 0) represent the effects of the locations on the determination of 
service, the 7;s represent the effects of gender on the service index, and the y;,’s 
represent the location-gender interaction effects. 

The design matrix Z is not of full rank. (For instance, column 1 equals the sum 
of columns 2-4 or columns 5-6.) In fact, rank(Z) = 6. 

For the complete model, results from a computer program give 


SSyes(Z) = 2977.4 


and n — rank(Z) = 18 ~ 6 = 12. 
The model without the interaction terms has the design matrix Z, consisting of 
the first six columns of Z. We find that 


SSres(Z1) = 3419.1 
with n — rank(Z,) =18-4= 14. To test Hp: N11 = Y12 = Y21 = ¥22 = ¥31 = 
¥32 = 0 (no location~-gender interaction), we compute 
F= (SS,es(Z1) i SSres(Z))/(6 rE 4) = (SSyes(Z1) = SSyes(Z))/2 
s* SSyes(Z)/12 
_ 3419.1 — 2977.4) /2 _ 
_ 2977.4/12 7 


378 Chapter 7 Multivariate Linear Regression Models 


The F-ratio may be compared with an appropriate percentage point of an 
F-distribution with 2 and 12 df. This /-ratio is not significant for any reasonable Sig- 
nificance level a. Consequently, we conclude that the service index does not depend 
upon any location~gender interaction, and these terms can be dropped from the 
model. 

Using the extra sum-of-squares approach, we may verify that there is na differ. 
ence between locations (no location effect), but that gender is significant; that is, 
males and females do not give the same ratings to service. 

In analysis-of-variance situations where the cell counts are unequal, the varia- 
tion in the response attributable to different predictor variables and their interac- 
tions cannot usually be separated into independent amounts. To evaluate the 
relative influences of the predictors on the response in this case, it is necessary to fit 
the model with and without the terms in question and compute the appropriate 
F-test statistics. = 


7.5 Inferences from the Estimated Regression Function 


Once an investigator is satisfied with the fitted regression model, it can be used to 
solve two prediction problems. Let 29 = [1,Z01,---,20,] be selected values for the 
predictor variables. Then zp and @ can be used (1) to estimate the regression func- 
tion Bo + B12; + +-: + B,%, at Zp and (2) to estimate the value of the response Y 


at Zo. 


Estimating the Regression Function at Zp 


Let Yo denote the value of the response when the predictor variables have values 
zo = [1, 20),--+, Zor]. According to the model in (7-3), the expected value of Yq is 


E(Y¥olzo) = Bo + Bizor + °° + Br2or = 2B (7-15) 
Its least squares estimate is zB. 


Result 7.7. For the linear regression model in (7-3), Zo is the unbiased linear 
estimator of E(Y)lz9) with minimum variance, Var(zpB) = 2i(Z'Z) ‘zoo. If the 
errors € are normally distributed, then a 100(1 — a)% confidence interval for 
E(Yo\%9) = zoB is provided by 


ri naval 2) VOGEBP eS? 


where t,-,-;(a/2) is the upper 100(a/2}th percentile of a ¢-distribution with 
n-r-I1df 


Proof. For a fixed z),2)6, is just a linear combination of the B;’s, so Result 
7.3 applies. Also, Var(z)B8) = zp Cov (B)zZo = 29(Z'Z) zoa° since Cov(B) = 
a*( Z'Z)” by Result 7.2. Under the further assumption that & is normally distrib- 


uted, Result 7.4 asserts that B is N,.1(8,0°(Z'Z) _) independently of s?/o”, which 


Inferences from the Estimated Regression Function 379 
is distributed as y2_ 5°- i/(nm — r — 1). Consequently, the linear combination ZB is 
N(26B, 0729(Z' Z)'z 9) and 

t R z ’ , -1 é R ’ 
(2B — 2B)/V 072)(Z'Z) 2 = (zB ~ 28) 
Vs3/0? V s°(2i(Z'Z) '29) 


is distributed as ¢,_,_,. The confidence interval follows. | 


Forecasting a New Observation at 2 


Prediction of a new observation, such as Yo, at 29 = [1, 2},.--, 20] is more uncertain 
than estimating the expected value of Yy. According to the regression model of (7-3), 


Yo = ZoB + & 
or 
(new response Yo) = (expected value of Yj at z)) + (new error) 


where € is distributed as N(0,o*) and is independent of € and, hence, of B and s2 
The errors € influence the estimators B ands ? through the responses Y, but e, does not. 


Result 7.8. Given the linear regression model of (7-3), a new observation Yj has 
the unbiased predictor 


ZB = Bo + Bi2o1 + *°* + BrZor 
The variance of the forecast error Yy — zB is 
Var (Y) — zB) = 0°(1 + 24(Z'Z) “z) 


When the errors € have a normal distribution, a 100(1 — @)% prediction interval for 


Yo is given by 
2B + tor s() Vs(1 + 24(Z'Z) 29) 


2 
where 1,,-,-,(a/2) is the upper 100(a/2)th percentile of a ‘distribution with 
n —r — 1 degrees of freedom. 
Proof. We forecast Yo by 2B, which estimates E(Yq|z9). By Result 7.7, zB has 
= = 208 and Var (zB) = 24(Z'Z) “'z907. The forecast error is then 

~ 1)B = 1)B + & — 2B = & + 24(B-B). Thus, E(Y — 2hB) = E(e0) + 

one B)) = 0so the predictor i is unbiased. Since eq and B are ndenenveny 
Var (Yy — 2B) = Var (eo) + Var (2B) = 07 + 25(Z'Z) 'zyo? = (1 + 24(Z'Z) “"t9): 
If it is further assumed that e has a norma! distribution, then B is 
normally distributed, and so is the linear combination Yo — zB. Consequently, 
(% - 2B)/Vo2(1 + 2(Z'Z) zo) is distributed as N(0, 1). Dividing this ratio by 
V s*/o*, which is distributed as V y2-,-1/(n — r — 1), we obtain 

(Yo ~ zB) 

Vs%(1 + 24(Z!Z) 20) 


which is distributed as ¢,_,_,. The prediction interval follows immediately. = 


380 Chapter7 Multivariate Linear Regression Models 


The prediction interval for Yp is wider than the confidence interval for estimating . 
the value of the regression function E(¥)| 2) = zB. The additional uncertainty jn : 
forecasting Yo, which is represented by the extra term s* in the Expression | 
F(1 + 2(Z'Z) "zo), comes from the presence of the unknown error term &p. 


Example 7.6 (Interval estimates fora mean response and a future response) Companies 
considering the purchase of a computer must first assess their future needs in order™ 
to determine the proper equipment. A computer scientist collected data from seven 
similar company sites so that a forecast equation of computer-hardware requirements 
for inventory management could be developed. The data are given in Table 7.3 for 

z, = customer orders (in thousands) 

Zz = add-delete item count (in thousands) 

Y = CPU (central processing unit) time (in hours) 


Construct a 95% confidence interval for the mean CPU time, E(Yglzo) =. 
Bo + B1201 + B22%02 at Z = [1, 130, 7.5]. Also, find a 95% prediction interval for a 
new facility's CPU requirement corresponding to the same Zp. 

A computer program provides the estimated regression function 


y = 8.42 + 1.08z, + .42z, 
8.17969 
(Z'Z)? =| -.06411  .00052 
08831 -.00107 .01440 
and s = 1.204. Consequently, 
zB = 8.42 + 1.08(130) + .42(7.5) = 151.97 


and sV 20(Z'Z) ‘Zo = 1.204(.58928) = .71. We have t,(.025) = 2.776, so the 95% 
confidence interval for the mean CPU time at Zp is 


2B + ty(.025)sV 24(Z'Z) 2 = 151.97 + 2.776(.71) 
or (150.00, 153.94). 


Table 7.3 Computer Data 


qy . 22 Y 
(Orders) (Add-delete items) (CPU time) 
123.5 2.108 141.5 
146.1 9.213 168.9 
133.9 1.905 154.8 
128.5 815 146.5 
151.5 1.061 172.8 
136.2 8.603 160.1 


92.0 


Source: Data taken from H. P. Artis, Forecasting Computer Requirements: A 
Forecaster’s Dilemma (Piscataway, NJ: Bell Laboratories, 1979). 


1.125 


Model Checking and Other Aspects of Regression 381 


Since sV 1 + 2)(Z'Z) to = (1.204)(1.16071) = 1.40, a 95% prediction inter- 
val for the CPU time at a new facility with conditions Zp is 


ZB + t4(.025)sV1 + 24(Z'Z) ‘29 = 151.97 + 2.776(1.40) 


or (148.08, 155.86). = 


7.6 Model Checking and Other Aspects of Regression 
Does the Model Fit? 


Assuming that the model is “correct,” we have used the estimated regression 
function to make inferences. Of course, it is imperative to examine the adequacy of 
the model before the estimated function becomes a permanent part of the decision- 
making apparatus. 

All the sample information on lack of fit is contained in the residuals 


ey =\N ~ Bo ~ Bizir pS B,21r 
& =  — Bo — Biz21 — ++: — Br22, 
En = Yn Bo a, Bizn1 nr BrZnr 
or 
é = [I - Z(Z'Z) 'Z']y = [I - Hy (7-16) 


If the model is valid, each residual €; is an estimate of the error e,, which is assumed to 
be a normal random variable with mean zero and variance a”. Although the residuals 
é have expected value 0, their covariance matrix o?[I — Z(Z'Z) 'Z’'] = o?[I - HJ 
is not diagonal. Residuals have unequal variances and nonzero correlations. Fortu- 
nately, the correlations are often small and the variances are nearly equal. 

Because the residuals é have covariance matrix ¢*[I — H], the variances of the 
g; can vary greatly if the diagonal elements of H, the /everages h,;, are substantially 
different. Consequently, many statisticians prefer graphical diagnostics based on stu- 
dentized residuals. Using the residual mean square s* as an estimate of a”, we have 


Var (&) = 931 — Ay), f= 1,2,-...0 (7-17) 


and the studentized residuals are 
= ————, Sf = 1,2,...,0 (7-18) 


We expect the studentized residuals to look, approximately, like independent drawings 
from an N(0,1) distribution. Some software packages go one step further and 
studentize &, using the delete-one estimated variance s*(j), which is the residual 
mean square when the jth observation is dropped from the analysis. 


382 Chapter 7 Multivariate Linear Regression Models 


Residuals should be plotted in various ways to detect possible anomalies. Fo, 


general diagnostic purposes, the following are useful graphs: 


1. 


Plot the residuals €; against the predicted values y; = Bo + Bz fee yp B,2;,. 


Departures from the assumptions of the model are typically indicated by tee . 


types of phenomena: 
(a) A dependence of the residuals on the predicted value. This is iilustrated jn 


Figure 7.2(a). The numerical calculations are incorrect, or a By term has: 


been omitted from the model. 


} 


(b) The variance is not constant. The pattern of residuals may be funnel] 
shaped, as in Figure 7.2(b), so that there is large variability for large } and 
small variability for small y. If this is the case, the variance of the error is - 


not constant, and transformations or a weighted least squares approach (or 
both) are required. (See Exercise 7.3.) In Figure 7.2(d), the residuals form a 
horizontal band. This is ideal and indicates equal variances and no depen- 
dence on y. 


tor variables, such as zi or 2422. A systematic pattern in these plots suggests the 
need for more terms in the model. This situation is illustrated in Figure 7.2(c). 


. QO-Q plots and histograms. Do the errors appear to be normally distributed? To 


answer this question, the residuals é; or €; can be examined using the techniques 
discussed in Section 4.6. The Q-Q plots, histograms, and dot diagrams help to 
detect the presence of unusual observations or severe departures from normal- 
ity that may require special attention in the analysis. If n is large, minor depar- 
tures from normality will not greatly affect inferences about B. 


m> 


‘<> 


. Plot the residuals é; against a predictor variable, such as 21, or products of predic. 


(d) Figure 7.2 Residual plots. : 


Model Checking and Other Aspects of Regression 383 


4. Plot the residuals versus time. The assumption of independence is crucial, but 
hard to check. If the data are naturally chronological, a plot of the residuals ver- 
sus time may reveal a systematic pattern. (A plot of the positions of the residu- 
als in space may also reveal associations among the errors.) For instance, 
residuals that increase over time indicate a strong positive dependence. A statis- 
tical test of independence can be constructed from the first autocorrelation, 


1 = (7-19) 


of residuals from adjacent periods. A popular test based on the statistic 
n 


n 
> (é - a)?/ > &} = 2(1 — r) is called the Durbin—Watson test. (See [14] 
: = 


jr2 
for a description of this test and tables of critical values.) 


Example 7.7 (Residual plots) Three residual plots for the computer data discussed 
in Example 7.6 are shown in Figure 7.3. The sample size n = 7 is really too small to 
allow definitive judgments; however, it appears as if the regression assumptions are 
tenable. = 


™ 


1.0 1.0 ° e 
3 
el ieee Deere | 
2 3 io 2 
-1.0 -10--°% @ ° 


(b) 


Figure 7.3 Residual plots for the computer data of Example 7.6. 


384 Chapter 7 Multivariate Linear Regression Models 


If several observations of the response are available for the sare values of th, 
predictor variables, then a formal test for lack of fit can be carried out. (See [13] f é 
a discussion of the pure-error lack-of-fit test.) oO 


Leverage and Influence 


Although a residual analysis is useful in assessing the fit of a model, departures fr 
the regression model are often hidden by the fitting process. For example, there = ie 
be “outliers” in either the response or explanatory variables that can have a pie ; 
erable effect on the analysis yet are not easily detected from an evaminstnd 
residual plots. In fact, these outliers may determine the fit. " 
The leverage /;; the (j, /) diagonal element of H = Z(Z'Z)'Z, can be interpret- 
ed in two related ways. First, the leverage is associated with the jth data ne : 
sures, in the space of the explanatory variables, how far the jth observation is from ty A 
other n — 1 observations For simple linear regression with one explanatory variable 


Pe ted a 


hie Se 


St - 2) 


The average leverage is (r + 1)/n. (See Exercise 7.8.) 
Second, the leverage /;;, is a measure of pull that a single case exerts on the fit. 


The vector of predicted values is 
§ = ZB = U(Z'Z) "zy = Hy 


where the jth row expresses the fitted value j; in terms of the observations as 
5 = yyy + Die Me 
kA; 
Provided that all other y values are held fixed 
(change in y;) = A;; (change in y,) 


If the leverage is large relative to the other 4, then y; will be a major contributor to 


the predicted value ¥j. 

Observations that significantly affect inferences drawn from the data are said to 
be influential. Methods for assessing influence are typically based on the change in 
the vector of parameter estimates, 8, when observations are deleted. Plots based 
upon leverage and influence statistics and their use in diagnostic checking of regres- 
sion models are described in [3], [5], and [10]. These references are recommended 
for anyone involved in an analysis of regression models. 

If, after the diagnostic checks, no serious violations of the assumptions are de- 
tected, we can make inferences about £ and the future Y values with some assur- 
ance that we will not be misled. 


Additional Problems in Linear Regression 


We shall briefly discuss several important aspects of regression that deserve and receive 
extensive treatments in texts devoted to regression analysis. (See [10], {11], [13], and [23].) 


Model Checking and Other Aspects of Regression 385 


Selecting predictor variables from a large set. In practice, it is often difficult to for- 
mulate an appropriate regression function immediately. Which predictor variables 
should be included? What form should the regression function take? 

When the list of possible predictor variables is very large, not all of the variables 
can be included in the regression function. Techniques and computer programs de- 
signed to select the “best” subset of predictors are now readily available. The good 
ones try all subsets: z; alone, z2 alone,..., z; and z2,.... The best choice is decided by 
examining some criterion quantity like R*. {See (7-9).] However, R? always increases 
with the inclusion of additional predictor variables. Although this problem can be 
circumvented by using the adjusted R?, R? = 1 — (1 — R*)(n — 1)/(n -—r — 1),a 
better statistic for selecting variables seems to be Mallow’s C, statistic (see [12]), 


residual sum of squares for subset model 
with p parameters, including an intercept 


a (residual variance for full mode!) b= 2p) 
A plot of the pairs (p.C,), one for each subset of predictors, will indicate models 
that forecast the observed responses well. Good models typically have (p, C,) coor- 
dinates near the 45° line. In Figure 7.4, we have circled the point corresponding to 
the “best” subset of predictor variables. 

If the list of predictor variables is very long, cost considerations limit the number 
of models that can be examined. Another approach, called stepwise regression (see 
[13]), attempts to select important predictors without considering all the possibilities. 


Figure 7.4 C,, plot for computer 
data from Example 7.6 with 
three predictor variables 

(2; = orders, 2. = add-delete 
count, 23 = number of items; see 
the example and original source). 


Numbers in parentheses 
correspond to predictor 
variables 


386 Chapter7 Multivariate Linear Regression Models 


The procedure can be described by listing the basic steps (algorithm) involved in the . 
computations: 


coe ane 


Step 1. All possible simple linear regressions are considered. The predictor varjab] 
that explains the largest significant proportion of the variation in Y (the variable 
that has the largest correlation with the response) is the first variable to enter the re, 
gression function. 
Step 2. The next variable to enter is the one (out of those not yet included) that~ 
makes the largest significant contribution to the regression sum of squares. The sip. : 
nificance of the contribution is determined by an F-test. (See Result 7.6.) The value 
of the F-statistic that must be exceeded before the contribution of a variable is 
deemed significant is often called the F fo enter. 4 


Step 3. Once an additional variable has been included in the equation, the individ. 3 

ual contributions to the regression sum of squares of the other variables already in “* 
the equation are checked for significance using F-tests. If the F-statistic is less than | 
the one (called the F to remove) corresponding to a prescribed significance level, the ; 
variable is deleted from the regression function. a 


Step 4. Steps 2.and 3 are repeated until all possible additions are nonsignificant and © 
all possible deletions are significant. At this point the selection stops. 


Because of the step-by-step procedure, there is no guarantee that this approach 
will select, for example, the best three variables for prediction. A second drawback is 
that the (automatic) selection methods are not capable of indicating when transfor. 
mations of variables are useful. 

Another popular criterion for selecting an appropriate model, called an infor- * 
mation criterion, also balances the size of the residual sum of squares with the num- 
ber of parameters in the model. 

Akaike’s information criterion (AIC) is 


A 


residual sum of squares for subset model 


with p parameters, including an intercept 


AIC = nIn _ +2p 


It is desirable that residual sum of squares be small, but the second term penal- 
izes for too many parameters. Overall, we want to select models from those having 
the smaller values of AIC. 


Colinearity. If Z is not of full rank, some linear combination, such as Za, must equal 
0. In this situation, the columns are said to be colinear. This implies that Z’Z does | 
not have an inverse. For most regression analyses, it is unlikely that Za = 0 exactly. 
Yet, if- linear combinations of the columns of Z exist that are nearly 0, the calculation 
of (Z' Z)’ is numerically unstable. Typically, the diagonal entries of (Z'Z) ’ will 
be large. This yields large estimated variances for the ;’s and it is then difficult 
to detect the “significant” regression coefficients 8;. The problems caused by colin- 
earity can be overcome somewhat by (1) deleting one of a pair of predictor variables = 
that are strongly correlated or (2) relating the response Y to the principal compo- % 
nents of the predictor variables—that is, the rows z; of Z are treated as a sample, and 
the first few principal components are calculated as is subsequently described in - 
Section 8.3. The response Y is then regressed on these new predictor variables. 


Multivariate Multiple Regression 387 


Bias caused by a misspecified model. Suppose some important predictor variables 
are omitted from the proposed regression model. That is, suppose the true model 
has Z = [Z, } Z,] with rank r + 1 and 


Y= [ Z H 2 eacesrennttees: + £ (7-20) 
(nx1) (ax(g+1)) | (ax(r-4)) Ba) (nx1) 
((r-@)x1) | 


= Z1 Bq) + Z2Ba) + € 


where E(e) = 0 and Var(e) = o7I. However, the investigator unknowingly fits 
a model using only the first g predictors by minimizing the error sum Of 
squares (Y ~ Z,Bq1))'(Y — Z1B1)). The least squares estimator of B(1) is Ba) = 
(ZiZ1) ZY. Then, unlike the situation when the model is correct, 


E(Bay) = (ZiZ1) ‘Zi E(Y) = (ZZ) 'Zi(Zi Bay + ZB) + E(e)) 
= Buy + (ZiZy) Z4Z2B,2y (7-21) 


That is, Ba) is a biased estimator of B (1) unless the columns of Z, are perpendicular 
to those of Z2 (that is, Z{Z, = 0). If important variables are missing from the 
model, the least squares estimates $8 (;) may be misleading. 


7.7 Multivariate Multiple Regression 


In this section, we consider the problem of modeling the relationship between 
m responses Y;, Y>,..., Y, and a single set of predictor variables 2), Z,,...,2,- Each 
response is assumed to follow its own regression model, so that 


Y; = Bor + B11z% +--+ + Biz + & 
Y, = Bor + Bi2% +°°° + Bp2z, + &2 (7-22) 


Y, = Bom + BimZ1 +°°> + BrmzZ, + Em 


The error terme’ = [e,,&,...,&,,| has E(e) = 0 and Var(e) = &. Thus, the error 
terms associated with different responses may be correlated. 

To establish notation conforming to the classical linear regression model, let 
[Zjo.Zj1,---,Zjr] denote the values of the predictor variables for the jth trial, 
let ¥; = [¥j1, ¥j2,.--,¥jm] be the responses, and let e; = [e;;,€)2,--.,€jm] be the 
errors. In matrix notation, the design matrix 


210 211 “°° Z21r 
Z =| 220 21 7 Pr 
(nX(r-+1)) : : : 


Fin 7a cia ne Genes 


388 Chapter7 Multivariate Linear Regression Models 


is the same as that for the single-response regression model. [See (7-3).] The othey 
matrix quantities have multivariate counterparts Set 


Yur Yio + Mim 


Yy Yoo ia Yom ‘ , ‘ 
aS : : : aap ba [Yu ‘ Yp) pee Yon) 
Yn Y,2 Yam 
Bo. Bor Bom 
Bi, Biz Bim| _ be Oa 
ee te. [Buy } Bey is} Bom) 
Bri Bra iia Brm 
B £12 "Elm 
— |} 21 22 7° fam | _ : 4 ‘ 
Geap ck ee emt are ed etl 
LEnt En2 °** Enm 
ret 
=| £2. 
Len, 
The multivariate linear regression model is 
= Z + € 
(nxm) — (nX(r+1)) ((r+1)xm) (Xm) 
with (7-23) 


E(€(i)) =0 and Cov (ei), &(4)) = ond i,k =1,2,...,m 


The m observations on the jth trial have covariance matrix £ = {oj}, but ob- 
servations from different trials are uncorrelated. Here B and o;, are unknown 
parameters; the design matrix Z has jth row (2/9, 21,---»2jrl- 


Simply stated, the ith response Y(;) follows the linear regression model 
Yi) = ZBw) + Evi), i= 1, 2, ere (3 (7-24) re 


with Cov (&;)) = o;;1. However, the errors for different responses on the same trial 
can be correlated. 

2 Given the outcomes Y and the values of the predictor variables Z with full 
column rank, we determine the least squares estimates B,;) exclusively from t 
observations Y,,, on the ith response. In conformity with the single-respon 
solution, we take 


Bw = (ZZ) ZY) (725 


ee Multivariate Multiple Regression 389 


Collecting these univariate least squares estimates, we obtain 


B = [Buy | Ba) i: ! Bum)] = (Z'Z)Z' [Yay | Yay ioc 1 Yom] 
or 
B= (Z2)'ZY (7-26) 
For any choice of parameters B = [bq) | By) «> | Dyny], the matrix of errors 


is Y — ZB. The error sum of squares and cross products matrix is 


(Y - ZB)'(Y — ZB) 


(Ya) — Zbeay)’(Yay — Zbay) + (Kay — Zhe)’ (Xm) — Zbem)) 
(Ym) — Zbomy)'(Xay — Zbey) + (Xony — Zbony)! (Yom) — Zbin)) 
(7-27) 


The selection bij) = Buy minimizes the ith diagonal sum of squares 
(Yq) — Zbi)'(X) — Zbyy). Consequently, tr[(Y — ZB)'(Y — ZB)] is minimized 
by the choice B = 3. Also, the generalized variance | (Y — ZB)'(Y — ZB)| is min- 
imized by the least squares estimates B. (See Exercise 7.11 for an additional general- 
ized sum of squares property.) 
Using the least squares estimates B, we can form the matrices of 

Predicted values: Y = ZB = Z(Z'Z)'Z'Y 

Residuals: é=Y-Y =[1- z(z'Z)'z']Y (7-28) 
The orthogonality conditions among the residuals, predicted values, and columns of Z, 


which hold in classical linear regression, hold in multivariate multiple regression. 
They follow from Z'[I — Z(Z'Z)'Z'] = Z' — Z’ = 0. Specifically, 


Zé =Z'[l - Z(Z'Z)'Z'|Y =0 (7-29) 
so the residuals Ei) are perpendicular to the columns of Z. Also, 
Y'é = B'Z'[I — Z(Z'Z)'2']Y = 0 (7-30) 


confirming that the predicted values Xi are perpendicular to all residual vectors 
x). Because Y = Y + é, 


A 


YY =(Y+6)(Y¥+8) =VV+&é +040 


or 


Mt 


Y’Y + é'E 

residual (error) sum 
of squares and 
cross products 


(7-31) 


Y'Y 


totalsum ofsquares\ _ / predicted sum of squares re 
and cross products and cross products 


390 Chapter? Multivariate Linear Regression Models 


The residual sum of squares and cross products can also be written as 


é'é=YY-YY=Y'Y- p'27zp (7-32) ; 
Example 7.8 {Fitting a multivariate straight-line regression model) To illustrate ke 
calculations of 8, Y, and E, we fit a straight-line regression model (see Panel 7.2), : 
Ya = Bor + Burzi + & : 


Yyo = Bort Bi2%yjtep, j= 1,2,...,5 


kh ate oe 


to two responses ¥, and ¥ using the data in Example 7.3. These data, augmented by : 
observations on an additional response, are as follows: 


2) 0 1 2 3 4 
" H 4 3 8 9 : 
» ee 2 


el ee ee ca a eet Allg Ose 
eed oe fe al 


PANEL 7.2 SAS ANALYSIS FOR EXAMPLE 7.8 USING PROC. GLM. 


title ‘Multivariate Regression Analysis’; 

data mra; 

infile ‘Example 7-8 data; 

input y1 y221; PROGRAM COMMANDS 


Proc glm data = mra; 
model y1 y2 = z1/ss3; 
manova h = z1/printe; 


General Linear Models Procedure 


Dependent Variable: Y1 


Source F Sum of Squares Mean Square 
Model 1 40.00000000 40.00000000 20.00 0.0208 
Error 3 6.00000000 2.00000000 
Corrected Total 4 46.00000000 
R-Square CN. Root MSE Y1 Mean 
0.869565 28.28427 1.414214 


(continues on next page) = 


Multivariate Multiple Regression 391 


pANEL 7.2 (continued) 


Source DF Type WSS Mean Square F Value Pr>F 
40.00000000 40.00000000° 0.0208 


T for HO: Std Error of 
Parameter = 0 Pr > ITI Estimate 
0.91 0.4286 1.09544512 
447 0.44721360 


Estimate 
1.000000000 


Source DF Sum of Squares Mean Square F Value Pr>F 
Model 1 10,00000000 10.00000000 7.50 0.0714 
Error 3 4.00000000 1.33333333 
Corrected Total 4 14.00000000 
R-Square cy. Root MSE Y2 Mean 
0.714286 115.4701 1.154701 1.00000000 
Saurce DF Type IIE SS Mean Square F Value Pr > F 
21 1 10.00000000 10.00000000 7.50 0.0714 
T for HO: Std Error of 


_ Estimate Parameter = 0 Pr > STI Estimate 
-1.000000000 ~1.12 0.3450 0.89442719 
1.000000000 2.74 0.0714 0.36514837 


“JE = Error SS & CP Matrix 


Y1 Y2 


Parameter. 
INTERCEPT 
at? 


Y1 
Y2 


6 
-2 


~2 
4 


Manova Test Criteria and Exact F Statistics for 
the Hypothesis of no Overall 21 Effect 

H = Type I! SS&CP Matrix for 21 E = Error SS&CP Matrix 

S$=1 M=0 N=0 


Statistic Value F Num DF Den DF Pr>F 
Wilks’ Lambda 0.06250000 15.0000 2 PA 0.0625 
Pillai’s Trace 0.93750000 15.0000 2 
Hotelling-Lawley Trace 15.00000000 15.0000 2 
Roy’s Greatest Root 15.00000000 15.0000 2 


2 
2 
2 


392 Chapter 7 Multivariate Linear Regression Models 


and 
-1 
—1 
, 111i141 248 
20) i 123 5 : | 
2 
so 
fee pen 6 —2 5 -1 
ee) zse-|_$ A}{20}~ | 1 
From Example 7.3, 
. § a 
Buy = (Z'Z) Z'yay = 2 | 
Hence, 


B= [Byx) Boy] = E ee (Z'Z)- 'Z'Lya i ¥2)] 


The fitted values are generated from j, = 1 + 2z, and }, = —1 + 2. Collectively, 
1 0 1 -1 
. . 1 1 3 «0 
Y=ZB=|1 2 are 
1 3 Te 2 
1 4 9 3 
and 
: “ 0 1-21 of 
wan -1 1 “ 
Note that 
f =i 
ey |< 1-21 ale : -(§ | 
0 -1 di le Sh 7 2 0 0 
} 9 3 
Since 
1 -1 
1: £S eo = 171 43 
Wve) 4 eee 2 |= [7% | 
8 3 
9 2 


Multivariate Multiple Regression 393 


the sum of squares and cross-products decomposition 
YY=YVY:ie8 


is easily verified. = 


Result 7.9. For the least squares estimator B = [Buy i Bo) base] Bim) deter- 
mined under the multivariate multiple regression model (7-23) with full 
rank (Z) =r+1<a, 


E(Bw) = Buy or E(B) = B 


and 
Cov (By, Bis) = o;(Z'Z) , i,k = 1,2,...,m 
The residuals € = [€) | a) } ++: | Bam) = Y - ZB satisfy E(é,)) = 0 and 
E(€E (x) = (n = 1) 0;,,80 
‘ 1 pane 
E(€) =0 and e(—+— ee) =z 
n-r-1 


Also, € and B are uncorrelated. 


Proof. The ith response follows the multiple regression model 
Y(i) = ZB i) + Evi)» E(&i) = 0, and E(E(i)E()) = oil 
Also, from (7-24), 
Bi) ~ Bi = (ZZ) ZY) — By = (ZZ) Z'eq (7-33) 
and 
x . , si , , -1 , 
so E( Bw) = Bui and E(&q)) = 0. 
Next, 
Cov (Bi), Buy) = E(Buy — Bay) (Buy ~ Bu) 
= (Z'Z) 'Z' E(ewet))Z(Z'Z) | = o4,(Z'Z) 
Using Result 4.9, with U any random vector and A a fixed matrix, we have 


that E[U’AU] = E[tr(AUU’)] = tr[AE(UU’)]. Consequently, from the proof 
of Result 7.1 and using Result 2A.12 


E(ety( — Z(Z'Z) 'Z') ey) = te[(I — Z(Z’Z)Z') od] 


E(€E(x)) 


oigtt[(I — Z(Z'Z) 'Z')] = a(n — r - 1) 


394 Chapter 7 Multivariate Linear Regression Models 


Dividing each entry &()€(,) of ee by  ~ r — 1, we obtain the unbiased estimator 
of &. Finally, 


Cov (Bi €(a)) = El(Z'Z) 'Z'ewey (Il - Z(Z'Z) 'Z')I 

= (ZZ) 12’ E(eqyety) (I - Z(Z'Z) ZY ts 
(Z‘Z) "Zo A(l - Z(Z'Z) 'Z') 
= o%4((Z'Z)'Z! - (ZZ) 'Z') = 0 7 


so each element of B is uncorrelated with each element of é. =. 


The mean vectors and covariance thatrices determined in Result 7.9 enable us 

to obtain the sampling properties of the least squares predictors. ‘ 

We first consider the problem of estimating the mean vector when the predicts, : 

variables have the values 2) = [1,2 1,...,2,]. The mean of the ith response © 
variable is 298 ,;), and this is estimated by 2B.» the ith component of the fitted 

regression relationship. Collectively, _ 
MB = [2B | AB) |---| 6B ¢m)) (7-34) 


is an unbiased estimator 258 since E(zBq) = 25E (Buy ) = 29 Bi) for each compo- 
nent. From the covariance matrix for Buy and Bu, the estimation errors 7B) Lo Bw 
have covariances 
El2i( By ~ Be) (Bu) ~ Byy)'tol = 24(E(Bty - Boy) (Boy ~ Bray)')t0 
= 0 in2Q(Z'Z)" 2 (7-35) 
The related problem is that of forecasting a new observation vector Y= 
[Yo1, Yoo: ---» Yom] at 29. According to the regression model, Yo; = z0B(i) + &o; where 
the “new” error €) = [€91, €02,---, gm] is independent of the errors € and satisfies 
E(€9;) = Oand E(e9j€9x%) = ix. The forecast error for the ith component of Yp is 
Yi — ZB) = Yor — 2oBy + 2Ba) — ZB.) 
= €9; — 29( Bi) — By) 
so E(Yoi — 7B) = E(e9;) ~ 2E (Bu — Bi) = 0, indicating that 2B (i is an 
unbiased predictor of Yo;. The forecast errors have covariances 
E (Yo; ~ 2B) (You — 2B) 
= Eleni — 20( Bia) — Byy)) (0% - 2B x) - Bu)) 
= E(eoitox) + 20E(By) — By) (Buy ~ Bix))'20 
— 1E((Bi — By) €ox) — Eleo( Brey — Brey)')%0 
= oj,(1 + 29(Z'Z) 29) (7-36) 
Note that E((By - Bi) 0x) = Osince Be = (ZZ) "Z! €() + Bris mea aagael 
of €9.A similar alt olds for E(e9;( Buy - Bu)’. 


Maximum likelihood estimators and their distributions can be obtained when | 
the errors € have a normal distribution. 


edits 5 eee won § 


Multivariate Multiple Regression 395 


Result 7.10. Let the multivariate multiple regression model in (7-23) hold with full 
rank (Z) =r + 1,2 = (r +1) + m, and let the errors € have a normal distribu- 


tion. Then e 
B= (z'Z)"z'Y 
is the maximum likelihood estimator of B and B ‘has a normal distribution with 


E(B) = B and Cov( (Bw, Bu) =0;,(Z'Z) 1 Also, B i is independent of the max- 
imum likelihood estimator of the positive definite & given by 


25 “66 ==(¥ - ZB) (Y - ZB) 


and Fs 
n& isdistributedas W,,,-,-1(%) 


The maximized likelihood L (fi, £) = (2a)7™/2|¥|-7/2 7-0/2, 


Proof. (See website: www.prenhall.com/statistics) = 
Result 7.10 provides additional support for using least squares estimates. 


When the errors are normally distributed, B and n“€'€ are the maximum likeli- 
hood estimators of B and %, respectively. Therefore, for large samples, they have 
nearly the smallest possible variances. 


Comment. The multivariate multiple regression model poses no new computa- 
tional problems. Least squares (maximum likelihood) estimates, B,;) = (Z'Z) vA Yi)» 
are computed individually for each response variable. Note, however, that the madel 
requires that the same predictor variables be used for all responses. 


Once a multivariate multiple regression model has been fit to the data, it should 
be subjected to the diagnostic checks described in Section 7.6 for the single-response 
model. The residual vectors [&1,@;2,...,&m] can be examined for normality or 
outliers using the techniques in Section 4.6. 

The remainder of this section is devoted to brief discussions of inference for the 
normal theory multivariate multiple regression model. Extended accounts of these 
procedures appear in [2] and [18]. 


Likelihood Ratio Tests for Regression Parameters 


The multiresponse analog of (7-12), the hypothesis that the responses do not depend 
ON 29415 Zg+29--+» 2, becomes 


Ba) 


Ho: By) =0 where B= Mae) (7-37) 
Be) 
((r—g)Xm) 
Setting Z = Z, i Za I we can write the general model as 
(nx(q+1)) ¢ (nx(r-q)) 


396 Chapter 7 Multivariate Linear Regression Models 4 


Under Ho: By) = 0, Y = ZB) + € and the likelihood ratio test of Ho is baséda 
on the quantities involved in the 


extra sum of squares and cross products 

= (Y — ZBay)'(Y ~ Z:Ba)) - (Y ~ ZB)'(¥ - ZB) 
n(2y a 3) 
where Buy = (ZiZ,) ZY and 3, = ny = 2B) (¥ oH 2B .)). 


From Result 7.10, the likelitood ratio, A, can be expressed in terms of generali 
variances: : 


ll 


: max L(Bay,®) 2 L(Bw 31) : (24\" ‘ x 

maxL(B,¥) = 1(B,3) —\|%ul : 
Equivalently, Wilks’ lambda statistic 

ay. BL 

>>I 


can be used. 


ise Bc 


i 


Result 7.11. Let the multivariate multiple regression model of (7-23) hold with Zz 
of full rank r + 1 and (r + 1) + m Sn. Let the errors &€ be normally distributed 
Under Hp: B.2) = 0, n& is distributed as W,,n-r-1(X) independently of n(Xy - 3) 
which, in turn, is distributed as W,,r-q(&). The likelihood ratio test of Hy is equivalent 
to rejecting Ho for large values of 


ara (21) - [nd | 
- =-niIn| — = -nin—, r x 
|| [nX + n(% ~ )| 


For n large,” the modified statistic 


[mart hmars +1)}o( 24) 5 
2 ; (311 ‘ 


has, to a close approximation, a chi-square distribution with m(r — q) df 


Proof. (See Supplement 7A.) La 


If Z is not of full rank, but has rank r, + 1, then B = (2'Z) Z'Y, where 
(Z'Z) is the generalized inverse discussed in [22]. (See also Exercise 7.6.) The 
distributional conclusions stated in Result 7.11 remain the same, provided that r is-%, 
replaced by r; and q + 1 by rank (Z,). However, not all hypotheses concerning B 
can be tested due to the lack of uniqueness in the identification of B caused by th 
linear dependencies among the columns of Z. Nevertheless, the generalized inverse 
allows all of the important MANOVA models to be analyzed as special cases of 
multivariate multiple regression model. 


STechnically, both n — r and 2 — m should also be large to obtain a good chi-square approximati 


Multivariate Multiple Regression 397 


Example 7.9 (Testing the importance of additional predictors with a multivariate 
response) The service in three locations of a large restaurant chain was rated 
according to two measures of quality by male and female patrons The first service- 
quality index was introduced in Example 7.5. Suppose we consider a regression model 
that allows for the effects of location, gender, and the location-gender interaction on 
both service-quality indices. The design matrix (see Example 7.5) remains the same 
for the two-response situation. We shall illustrate the test of no location-gender inter- 
action in either response using Result 7.11. A computer program provides 


a) = pk = | 297739 1021.72 
and cross products nm | 1021.72 2050.95 


extra sum oaune)) - (3 7 3) _ | 441.76 246.16 
and cross products Mea 246.16 366.12 


Let Bay be the matrix of interaction parameters for the two responses. Although 
the sample size n = 18 is not large, we shall illustrate the calculations involved in 
the test of Ho: Bi) = 0 given in Result 7.11. Setting a = .05, we test Hp by referring 


1 [nd | ) 
—-|n-r,-1—-=(m-9r,+ +] ln | —-—-———————_ 
; Bee a | (= lS 
= -|s a een ee 502 a5 aoe 1 ing.7605) = 3.28 


to a chi-square percentage point with m(r, — q,) = 2(2) = 4.d£Since 3.28 < 3(.05) = 
9.49, we do not reject Ho at the 5% level. The interaction terms are not needed. = 


Information criterion are also available to aid in the selection of a simple but 
adequate multivariate multiple regresson model. For a model that includes d 
predictor variables counting the intercept, let 


a 1 : ‘ 
La = 7 (residual sum of squares and cross products matrix) 


Then, the multivariate multiple regression version of the Akaike’s information 
criterion is : 
AIC = nIn(| %a|) - 2p X d 
This criterion attempts to balance the generalized variance with the number of 
parameters. Models with smaller AIC’s are preferable. 
In the context of Example 7.9, under the null hypothesis of no interaction terms, 
we have n = 18, p = 2 response variables, and d = 4 terms, so 


afore eal) ere 
18 |1267.88 2417.07 
= 18 X In(20545.7) — 16 = 162.75 


AIC = nIn(|Z|) — 2p X d = 18 In 


More generally, we could consider a null hypothesis of the form Hy: CB = To, 
where C is (r — q) X (r + 1) and is of full rank (r — q). For the choices 


398 Chapter 7 Multivariate Linear Regression Models 


c= (0: I asd Ty = 0, this null hypothesis becomes Hy: CB = Bi») = 9 

(7-4) (0-4) : 
the case considered earlier. It can be shown that the extra sum of squares and crass 
products generated by the hypothesis Hy is 


n(S; ~ 8) = (CB - 1y)(C(Z2) "CY (CB - To) 


Under the null hypothesis, the statistic n(y = 3) is distributed as W,_,(2) inde- 
pendently of %. This distribution theory can be employed to develop a test of 
Hy: CB = Tp similar to the test discussed in Result 7.11. (See, for example, [18}].) 


Other Multivariate Test Statistics 


Tests other than the likelihood ratio test have been proposed for testing My: By.) = 0 
in the multivariate multiple regression model. 

Popular computer-package programs routinely calculate four multivariate test 
statistics. To connect with their output, we introduce some alternative notation. Let .. 
E be the p X perror, or residual, sum of squares and cross products matrix 


E= nt 
that results from fitting the full model. The p X p hypothesis, or extra, sum of 
squares and cross-products matrix 
= n(3; oe 3) 
The statistics can be defined in terms of E and H directly, or in terms of 
the nonzero eigenvalues 7; = 7) = ... = 1s of’ HE 1 where s = min (pP,r > q). 
Equivalently, they are the roots of (1 - x) - n&| = 0. The definitions are 


Wilks’ lambda = [] : = al 
ae JE +H] 


Pillai's trace = > = tr(H(H + E)'] 


a 


Hotelling—Lawley trace = y ny = tr[HE™] 


i=) 
1 
+n 


Roy’s greatest root = 


Roy’s test selects the coefficient vector a so that the univariate F-statistic based ona 
a’ Y; has its maximum possible value. When several of the eigenvalues n; are moder- 
ately large, Roy's test will perform poorly relative to the other three. Simulation 
studies suggest that its power will be best when there is only one large eigenvalue. 
Charts and tables of critical values are available for Roy’s test. (See [21] and 
[17].) Wilks’ lambda, Roy’s greatest root, and the Hotelling—Lawley trace test are 
nearly equivalent for large sample sizes. : 
If there is a large discrepancy in the reported P-values for the four tests, the 
eigenvalues and vectors may lead to an interpretation. In this text, we report Wilks’ - 
lambda, which is the likelihood ratio test. 


Multivariate Multiple Regression 399 


Predictions from Multivariate Multiple Regressions 


Suppose the model Y = ZB + €, with normal errors €, has been fit and checked for 
any inadequacies. If the model is adequate, it can be employed for predictive purposes. 

One problem is to predict the mean responses corresponding to fixed values Zo 
of the predictor variables. Inferences about the mean responses can be made using 
the distribution theory in Result 7.10. From this result, we determine that 


B'x% isdistributedas N,,(B'z9,2)(Z'Z) 'z9 2) 
and 
nZ isindependently distributedas W,_,-1(%) 


The unknown value of the regression function at zp is B'z9. So, from the discussion 
of the T?-statistic in Section 5.2, we can write 


P= ( B'x ~ B'r% ( n :)'( B'x) — B'x% ) (7-39) 
V2i(Z'Z)'25/ \n- 7-1 V1i(Z'Z) 25 

and the 100(1 — a)% confidence ellipsoid for B’z, is provided by the inequality 

rR -1 Pe 

, ’ , n a , ' 
(Bx ~ B'x0) (28) (Bn - Bn) 
mn-r-1 
5 12/2) M29] (AD) Fy ame) | (7-40) 

where F,y, ,—-—m(@) is the upper (100a)th percentile of an F-distribution with m and 


n-r-mdf. 
The 100(1 — a)% simultaneous confidence intervals for E(¥;) = 29) are 


an A= RNC ay 
ZB) + (ee Y Fane rml) az'2y'n(—2— fu), 


i=1,2,...,m (7-41) 


n-r—-m 


where Bw is the ith column of B and G;; is the ith diagonal element of >. 
The second prediction problem is concerned with forecasting new responses 
Yo = Bz + € at zo. Here €p is independent of €. Now, 


Y) — B'z = (B — B)'2% + & isdistributedas N,,(0,(1 + 24(Z'Z)~12z9)Z) 
independently of n&, so the 100(1 — a)% prediction ellipsoid for Yy becomes 
fy n ~\7 ar 
(Yo — Bz) (23) (¥ — B’z9) 
mn-r- ~) 


n-~r-m 


= (1 + 2)(Z'Z)z9) I( 


The 100(1 — a@)% simultaneous prediction intervals for the individual responses Yo; are 


Fran-r—m( @)| (7-42) 


m(n —r-— 1) 


’ 17\~1 Ht aa 
)Fnner-m(@) qa + 2((Z Z) Zo) (- ps 6u); 
i=1,2,...,m (7-43) 


ZoBu) + ( 


n-r-m 


400 Chapter 7 Multivariate Linear Regression Models 


where B(;), Gi, and F,.n~r—m(a) are the same quantities appearing in (7-41), Com- : 
paring (7. 41) and (7-43), we see that the prediction intervals for the actual values of 
the response variables are wider than the corresponding intervals for the expected : 
values. The extra width reflects the presence of the random error €p;. 


ws 


Example 7.10 (Constructing a confidence ellipse and a prediction ellipse for bivariate. 
responses) A second response variable was measured for the computer-requirement? 
problem discussed in Example 7.6. Measurements on the response Y,, disk : 


input/output capacity, corresponding to the 2, and z, values in that example were 
ys = [301.8, 396.1, 328.2, 307.4, 362.4, 369.5, 229.1] 


od 


Obtain the 95% confidence ellipse for B'zp and the 95% prediction ellipse for 
Y6 = [Yo1, Yoo] for a site with the configuration z{ = [1,130,7.5]. 
Computer calculations provide the fitted equation 


Jy = 14.14 + 2.252, + 5.672) 
with s = 1.812. Thus, Bia) = [14.14, 2.25, 5.67]. From Example 7.6, 
Bix = (8.42, 1.08, 42], 25Bq) = 151.97, and 2i(Z'Z)~ zp = .34725 
We find that 
2B(2) = 14.14 + 2.25(130) + 5.67(7.5) = 349.17 


and 


n= = Gen = ZB))'(Yay — ZB.) (Ya) ZB.))' (¥(2) ~ 280) | 
(¥(2) — ZB(2))'(yay ~ ZBay) (Ya) — ZB.))'(¥2) — ZBy2)) 


_ | 580 5.30 
~ 1530 13.13 


B’ Bay 7B.) 5 pa 
; : ~ | 349.17 
Bo) 2B 2) a 


, 26 : 
n=7,r = 2, and m = 2, a 95% confidence ellipse for B’Zy = [28 is, from 
(7-40), the set 


; ; 5.80 5.30 [7] zpBa) — 151.97 


< (3478)|(- 2 7, 3.05)| 


with F) 3(.05) = 9.55. This ellipse is centered at (151.97, 349.17). Its orientation and 
the lengths of the major and minor axes can be determined from the eigenvalues 
and eigenvectors of n&. 

Comparing (7-40) and (7-42), we see that the only change required for the 
calculation of the 95% prediction ellipse is to replace 2(Z'Z)~'Zp = .34725 with 


Since 


The Concept of Linear Regression 401 
Response 2 


380 


Prediction ellipse 
360 |- 


L 


340 Ls. Confidence ellipse 
Figure 7.5 95% confidence 
and prediction ellipses for 
i fe) 
oe Sie pe ee + Respanse'l the computer data with tw 
Oo © 120 140 160 180 responses. 


1 + 24(Z'Z) 129 = 1.34725. Thus, the 95% prediction ellipse for Yj = [¥p1, Yoo] is 
also centered at (151.97, 349.17), but is larger than the confidence ellipse. Both 
ellipses are sketched in Figure 7.5. 

It is the prediction ellipse that is relevant to the determination of computer 
requirements for a particular site with the givenzy. = 


7.8 The Concept of Linear Regression 


The classical linear regression model is concerned with the association between a 
single dependent variable Y and a collection of predictor variables z), z2,..-,2,- Lhe 
regression model that we have considered treats Y as a random variable whose 
mean depends upon fixed values of the z,’s This mean is assumed to be a linear func- 
tion of the regression coefficients By, Bi,.-., B,- 

The linear regression model also arises in a different setting. Suppose all the 
variables Y, Z,, Z2,...,Z, are random and have a joint distribution, not necessarily 


normal, with mean vector me and covariance matrix Dy . Partitioning w 
(r+1)X1 (r+1)X(r+1) 


and & in an obvious fashion, we write 


by Oyy | Tzy 


(1x1) (1x1) } (1xr) 
B= | oe and & = | ..------ devtegeans 
Bz Ozy : Xzz 
(rx1) (rx1) } (rxr) 
with 
ory = [Cyz,,0¥z)---.Frz,] (7-44) 


Xzz can be taken to have full rank.® Consider the problem of predicting Y using the 
linear predictor = by + b\Z, + --- + b,Z, = bp + WZ (7-45) 
STf Xzz is not of full rank, one variable—for example, Z,—can be written as a linear combination of 


the other Z;’s and thus is redundant in forming the linear regression function Z’f. That is, Z may be 
replaced by any subset of components whose nonsingular covariance matrix has the same rank a8 Xzz. 


402 Chapter 7 Multivariate Linear Regression Models 


For a given predictor of the form of (7-45), the error in the prediction of Y is 
prediction error = ¥ — by ~ 2, —*--— b,Z,=Y ~by—b'Z (7-46) : 
Because this error is random, it is customary to select by and b to minimize the : 
7 mean square error = E(Y - by ~ b'Z)* (7-47) 


Now the mean square error depends on the joint distribution of Y and Z only 
through the parameters p and ¥. It is possible to express the “optimal” linear pre- 
dictor in terms of these latter quantities. 
Result 7,12. The linear predictor By + B'Z with coefficients 

B=Xz202y, Bo= by — Buz 
has minimum mean square among all finear predictors of the response Y. Its mean 
square error is 


E(Y — By — B'Z) = E(¥ - py ~ opy%zz(Z - mz)!’ = oyy ~ TzyXzyozy . 
Also, Bo + B'Z = py + ozyXzz(Z — #z) is the linear predictor having maxi- 
mum correlation with Y; that is, 

Corr(Y, By + B'Z) = max Cor(¥, by + b’Z) 


_ {|B %a2B _ /ozy2zzozy 
oyy ovy 


Proof. Writing by + b’Z = by + b'Z + (py — b'#z) ~ (uy — b’ pz), we get 
E(Y ~ by — b'ZP = E[Y — py ~ (b'Z ~ b’'pz) + (wy ~ by — b' pz) P 
= E(¥ - py) + E(b'(Z — az)!’ + (wy ~— by — b’ pz? 
— 2E[b'(Z — wz)(Y — wy)] 
= oyy + b'Xzzb + (uy - bo — b'uz)? — 2b’ozy 


Adding and subtracting oy &z,0zy, we obtain 
E(Y ~ by ~ b'Z) = oyy — opy¥z20zy + (uy ~ by — b’ az)? 

+ (b — Xzzezy)'Xzz(b — Lzzozy) 
The mean square error is minimized by taking b = Lzyozy = B, making the last 
term zero, and then choosing by = py — (Xzhozy)' Mz = Bo to make the third 
term zero. The minimum mean square error is thus cyy — ofyZzyozy- 

Next, we note that Cov (by + b’Z, Y) = Cov(b’Z, Y) = b’azy so 
oyy(b'Xzzb)’ 
Employing the extended Cauchy-Schwartz inequality of (2-49) with B = zz, we 
obtain 


[Corr (by + b'Z, ¥)? = for all by, b 


(b'ozy)’ = b'Ezzbo0zy2zzozy 


The Concept of Linear Regression 403 
or 


oy ize 
[Corr(by + b’Z,Y)P = 200222" 
oyy 
with equality for b = Xz4ozy = B. The alternative expression for the maximum 
correlation follows from the equation ozySzyozy = ofyB = ozyXzzXzzB = 
B'XzzB. = 


The correlation between Y and its best linear predictor is called the population 
multiple correlation coefficient 


[ozyXzz0zy 
py(z) = + Par Be (7-48) 


The square of the population multiple correlation coefficient, PYZ)> is called the 
population coefficient of determination. Note that, unlike other correlation coeffi- 
cients, the multiple correlation coefficient is a positive square root,so0 = pyyz) = 1. 
_ The population coefficient of determination has an important interpretation. 
From Result 7.12, the mean square error in using By + B’Z to forecast Y is 


- ozyXzzFzy 
oyy ~ OzyXzz0zy = oyy — oyy( TerseEtey = oyy(1 — pyzy) (7-49) 
If p}z) = 0, there is no predictive power in Z. At the other extreme, pz) = 1 im- 
plies that Y can be predicted with no error. 


Example 7.11 (Determining the best linear predictor, its mean square error, and the 
multiple correlation coefficient) Given the mean vector and covariance matrix of Y, 
21,2, 


By} | 3. ovy | Oy 20; 1 a1 
B= =|2] and Y= eae aa =| 1:7 3 
HZ 0 Ozy Xzz 113 2 


determine (a) the best linear predictor By + B,Z, + B2Z2, (b) its mean square 
error, and (c) the multiple correlation coefficient. Also, verify that the mean square 
error equals oyy(1 == pyz)): 

First, 


ic 7 3/7! 1 4 -6 1 1 
pn rion = [227 La]-L8 t)Lal-[4] 
Bo = wy ~ B'pz = 5 — i.-2]|2 | = 3 


so the best linear predictor is By + B'Z = 3 + Z, — 2Z,. The mean square error is 


a 4 -.6 1 
oyy ~ OzyLzzozy = 10 — {1,~-1] ie | ba =10-3=7 


404 Chapter 7 Multivariate Linear Regression Models 


and the multiple correlation coefficient is 


oxy >zz0zy 3 : 
ok Ake Le : 
Py(z) ay 10 


Note that oy y(1 — pyz)) = 10(1 - 3) = 7 is the mean square error. a. 


It is possible to show (see Exercise 7.5) that 


1 
1 — pz) = ae (7-50) | 


where p*” is the upper-left-hand corner of the inverse of the correlation matrix - 
determined from &. 

The restriction to linear predictors is closely connected to the assumption of 
normality. Specifically, if we take 


Z, {| tobedistributedas N,.1(p,%) 
Z, 
then the conditional distribution of Y with z,, 2, ..., 2, fixed (see Result 4.6) is 
N(py + ozy%z2(Z — wz), 0yy ~ ozy2zzozy) 
The mean of this conditional distribution is the linear predictor in Result 7.12. 
That is, 
E(¥|21, 22,+-+52) = by + OzyEzz(z — wz) (7-51) 
= Bot Bz 
and we conclude that E(Y|z;, z2,...,2,) is the best linear predictor of Y when the 
population is N,+1(#, %). The conditional expectation of Y in (7-51) is called the 
regression function. For normal populations, it is linear. 
When the population is not normal, the regression function E(Y | 21, 22,..-,Z,) 
need not be of the form By + B'z. Nevertheless, it can be shown (see [22]) that 
E(Y |z,%,.--,2,), whatever its form, predicts Y with the smallest mean square 


error. Fortunately, this wider optimality among all estimators is possessed by the 
linear predictor when the population is normal. 


Result 7.13. Suppose the joint distribution of Y and Z is N,+;(m, =). Let 


be the sample mean vector and sample covariance matrix, respectively, for a random 
sample of size n from this population. Then the maximum likelihood estimators of 
the coefficients in the linear predictor are 


B=Szhszy, Bo = Y — szySzhZ = Y - BZ 


The Concept of Linear Regression 405 


Consequently, the maximum likelihood estimator of the linear regression function is 
Bo + B'2 =¥ + szySzh(z — Z) 
and the maximum likelihood estimator of the mean square error E[Y — By — B'Z)’ is 


n-1 


aA a , -] 
Fyy-z = ——— (syy — SzySzz8zv) 


Proof. We use Result 4.11 and the invariance property of maximum likelihood esti- 
mators. [See (4-20).] Since, from Result 7.12, 

Bo = My ~ (Zzzezy)' Ez, 

B=Xzz0zy, Bot B'2=py + ozyXzz(z — wz) 
and 


mean square error = oyy-z = oyy — ofyZzyozy 


the conclusions follow upon substitution of the maximum likelihood estimators 


a= B and & = aca = (7 = *)s 
Zz Ozy i Lzz n 


for 


It is customary to change the divisor from nto n — (r + 1) in the estimator of the 
mean square error, oyy-z = E(Y — By — B'Z)’, in order to obtain the unbiased 
estimator : 


524 >, (¥%) ~ Bo - BZ) 
Goan (syy ~ seySzzszy) = © 


n-r-1 


(7-52) 


n-r-l1 


Example 7.12 (Maximum likelihood estimate of the regression function—single 
response) For the computer data of Example 7.6, the n = 7 observations on Y 
(CPU time), Z; (orders), and Z, (add-delete items) give the sample mean vector 
and sample covariance matrix: 


x 150.44 
w=] 2 | =] 130.24 
3.547 

syy i Sky 467.913 } 418.763 35.983 


wg, | = | 418-763 | 377-200 28.034 
ce ita 35.983 } 28.034 13.657 


406 Chapter 7 Multivariate Linear Regression Models 


Assuming that Y, Z,, and Z, are jointly normal, obtain the estimated regression 
function and the estimated mean square error. 
Result 7.13 gives the maximum likelihead estimates 


> gbg —| 003128 -.006422 || 418.763 | _ | 1.079 
B= Za0ZY “| ~.006422 086404 || 35.983 | | .420 
130.24 


= 150.44 — 142. ‘ 
oa : at 


Bo = y — B'% = 150.44 — [1.079, .420] | 
= 8421 
and the estimated regression function 
Bo + B'z = 8.42 — 1.08z, + .42z, 


The maximum likelihood estimate of the mean square error arising from the 
prediction of Y with this regression function is 


n-1 


(2) (syy — SzySzzSzy) 


6 003128 +~.006422 | | 418.763 
2 913 — (418. 
(5) (1 a aera! Ee rae | al 


Prediction of Several Variables 


The extension of the previous results to the prediction of several responses ¥,, 
Y,-.-, % is almost immediate. We present this extension for normal populations. 


Suppose 
Y 
(mx1) 
Petistees is distributed as N,,+-(m, &) 
Z 
(rx1) 
with ; 
By Zyy | Lyz 
(mx1) (mxm) } (mxr) 
Be ore id zr ae | cone anew ne. 2 ares 
Uz zy | Lzz 
(7x1) (rXm) i (rxXr) 
! 


By Result 4.6, the conditional expectation of [¥;, Y2,-.., Yj’, given the fixed values 
21, Z2,++-, 2, Of the predictor variables, is 
E[Y|21,225.--,2,] = wy + Zyz%z2(z — Mz) (7-53) 


This conditional expected value, considered as a function of z), Z2,...,Z,, is called 
the multivariate regression of the vector Y on Z. It is composed of m univariate 
regressions. For instance, the first component of the conditional mean vector is 
by, + Xy,z2zz(Z — Mz) = E(¥1|z1,2,-.-,Z,), which minimizes the mean square 
error for the prediction of Y;. The m x r matrix B = Lyz%Xz’ is called the matrix 
of regression coefficients, 


The Concept of Linear Regression 407 


The error of prediction vector 
Y — py ~ Xyz%zz(Z — wz) 
has the expected squares and cross-products matrix 
Zyyz = E[Y — wy ~ Zyz%z2(Z — wz)] [Y - wy — XyzXz2(Z — wz)]' 
= Zyy —Xyz2zz(Zyz)' —Lyz=zzEzy + Syz2zz2zz27z2(Zyz)' (7-54) 
= Xyy — Lyz¥zzizy 


Because w and & are typically unknown, they must be estimated from a random 
sample in order to construct the multivariate linear predictor and determine expect- 
ed prediction errors. 


Result 7.14. Suppose Y and Z are jointly distributed as N,,4,(, %). Then the re- 
gression of the vector Y on Z is 


Bo + Bz = wy — LyzXzzmz + LyzXzz2 = wy + LyzXzz(z — wz) 
The expected squares and cross-products matrix for the errors is 
E(Y ~ Bo — BZ)(Y ~ By ~ BZ)’ = Xyyz = Lyy — Lyz¥zzXzy 


Based on a random sample of size n, the maximum likelihood estimator of the 
regression function is 


Bo + Bx = ¥ + SyzSzh(z - Z) 
and the maximum likelihood estimator of Lyy.z is 


a n-1 2 
Zyyz = (=) (Syy — SyzSzSzy) 
Proof. The regression function and the covariance matrix for the prediction errors 


follow from Result 4.6. Using the relationships 
Bo = wy — XyzXzzmz, B= XyzXzz 
Bo + Bz = py + LyzXzz(z — mz) 
Lyy-z = Lyy — Lyzizzlzy = Lyy — Bizzf' 


we deduce the maximum likelihood statements from the invariance property [see 
(4-20)] of maximum likelihood estimators upon substitution of 


wet e wee oe: heneesaces 


Y a Zyy ' Byz _ = Syy | Syz 
P . (2 )s= (4 1) 


Zz Lzy | 2zz Z u Szy | Szz 
It can be shown that an unbiased estimator of Lyy.z is 
n-1 
ai ee ~'Ss =| 
(. Bie :) (Syy — SyzSzzSzy) 
1 n K Aa m a 
= ——— 3 (¥} — Bo — BZ;)(¥; - Bo ~ BZ;)’ (7-55) 


n-r-17Fy 


408 Chapter 7 Multivariate Linear Regression Models 


Example 7.13 (Maximum likelihood estimates of the regression functions—two 
responses) We return to the computer data given in Examples 7.6 and 7.10. For 
Y, = CPU time, Y2 = disk I/O capacity, Z,; = orders, and Z, = add~delete items, 


we have 
150.44 
~ | ¥| _ | 322.79 
z | | 130.24 
3.547 
and 


467.913 1148556} 418.763 35.983 
_ E a _ | 1148.556 3072.491{ 1008976 140.558 


~ 1°"418.763  1008976{ 377.200 28.034 


Szyi Szz 
35.983 140.558! 28.034 13.657 


Assuming normality, we find that the estimated regression function is 
By + Bx =¥ + SyzSze(2 ~ 7) 
_ a i | 418.763 ee] 
327.79 1008.976 140558 
003128 | k ~ al 


~.006422 086404 | | z — 3.547 


150.44 1.079 (z, ~ 13024) + .420(z) — 3.547) 
327.79 2.254(z, ~ 130.24) + 5.665(z) — 3.547) 


=~ 


Thus, the minimum mean square error predictor of Y, is 
150.44 + 1.079(z; — 130.24) + .420(z) - 3.547) = 8.42 + 1.082, + .42z, 
Similarly, the best predictor of Y, is 
14.14 + 2.252, + 5.6722 


The maximum likelihood estimate of the expected squared errors and cross- 
products matrix Lyy.z is given by 


n 
- ($) 467.913 1148.536 
7 1148.536 3072.491 

_ | 418.763 35.983 .003128 ~.006422 | | 418.763 1008.976 

1008.976 140.558 | | —.006422 086404 || 35.983 140.558 


- (8) 1.043 1.042] _ | 894 893 
7/ | 1.042 2.572 893 2.205 


pam, 1 _ 
(7 \(Svy ~ SyzSzzSzy) 


The Concept of Linear Regression 409 


The first estimated regression function, 8.42 + 1.08z, + .42z,, and the associated 
mean square error, .894, are the same as those in Example 7.12 for the single-response 
case. Similarly, the second estimated regression function, 14.14 + 2.25z, + 5.6723, is 
the same as that given in Example 7.10. 

We see that the data enable us to predict the first response, Y;, with smaller 
error than the second response, Y. The positive covariance .893 indicates that over- 
prediction (underprediction) of CPU time tends to be accompanied by overpredic- 
tion (underprediction) of disk capacity. r | 


Comment. Result 7.14 states that the assumption of a joint normal distribu- 
tion for the whole collection YW, ¥2,...,¥_,Z1, Z2,-..,Z, leads to the prediction 
equations 


A= Bor + Buz ++°++ Briz, 
de = Bor + Biot +°°* + Broz, 


Im = Bom + Bim +°°+ + Brm2r 
We note the following: 


1. The same values, z,, 22,..., 2, are used to predict each Yj. 
2. The B;, are estimates of the (i,k)th entry of the regression coefficient matrix 
B = Xyz%z¥ for i,k = 1. 


We conclude this discussion of the regression problem by introducing one further 
correlation coefficient. 


Partial Correlation Coefficient 

Consider the pair of errors 
Y, — wy, — Xy,zXzk(Z — wz) 
Yy — my, — Xyz%Xzz(Z — wz) 


obtained from using the best linear predictors to predict Y, and Y. Their correla- 
tion, determined from the error covariance matrix Zyy-z = Zyy — Zyz=zz2zy> 
measures the association between Y, and Y, after eliminating the effects of Z;, 
Z2,.+.5Lr- 

We define the partial correlation coefficient between Y; and ¥), eliminating Z,, 
2Zy,.-.,Z,, by 

OV 7% 
PY, Y>Z a (7-56) 
a Voy yz VOY Y7Z 

where oy,y,-z is the (/,k)th entry in the matrix Zyy-z = Zyy — ZyzXz'z2zy- The 
corresponding sample partial correlation coefficient is 


SY,Y¥yZ 


r end 7-57) 


410 Chapter 7 Multivariate Linear Regression Models 


with sy,y,-z the (i,k)th element of Syy — SyzSzySzy. Assuming that Y and Z have 
a joint multivariate normal distribution, we find that the sample partial correlation 
coefficient in (7-57) is the maximum likelihood estimator of the partial correlation 
coefficient in (7-56). 

Example 7.14 (Calculating a partial correlation) From the computer data in 
Example 7.13, 


1.043 1.042 
— SueSzl = - 
Syy — Sy2Sz'2Szy i | 


Therefore, 


7 SVi¥rZ 1.042 
TY YoZ ~ = =! 
1¥2 : Bane 1 sy v2 V1.043 V2.572 


64 (7-58) 


Calculating the ordinary correlation coefficient, we obtain ry,y, = .96. Compar- 
ing the two correlation coefficients, we see that the association between Y, and Y, 
has been sharply reduced after eliminating the effects of the variables Z on both 
responses. _ 


7.9 Comparing the Two Formulations of the Regression Model 


In Sections 7.2 and 7.7, we presented the multiple regression models for one 
and several response variables, respectively. In these treatments, the predictor 
variables had fixed values z; at the jth trial, Alternatively, we can start—as 
in Section 7.8—with a set of variables that have a joint normal distribution. 
The process of conditioning on one subset of variables in order to predict values 
of the other set leads to a conditional expectation that is a multiple regression 
model. The two approaches to multiple regression are related. To show this 
relationship explicitly, we introduce two minor variants of the regression model 
formulation. 


Mean Corrected Form of the Regression Model 
For any response variable Y, the multiple regression model asserts that 
Y,= Bot Bitij to + Bit, + &; 


The predictor variables can be “centered” by subtracting their means. For instance, 
Biz; = Bi(Z1; - Z,) + Bz, and we can write 


Y¥; = (Bo + Bi +++ + BZ) + Bilzij - %) +--+ BAZ; — 7) + 8; 


= Be t+ Bilaj- Zi) +7 + Bz ~ %) + 8 (7-59) 


Comparing the Two Formulations of the Regression Model 411 


with B. = By + Biz, + --- + B,Z,. The mean corrected design matrix corresponding 
to the reparameterization in (7-59) is 


12, 7% Zi, ~ 2, 
1 2%, -Z z Z 
Z. = 21 1 2r r 
1 Zn — Z Znr Z, 


where the last 7 columns are each perpendicular to the first column, since 
n 
» 1(zj:-%) =0, &=1,2,...,7 


Further, setting Z, = (1|Z,2] with Z!,1 = 0, we obtain 


11 VZ,, n 0 
ZZ. =| 9, Ae Sil lies D 
Ee | E ie. 


so 
Bs 
By | = (ZiZ.) Bey 
b, 
a ae sede eee ee 
0 (Zi.Z.2)" Ze2y (Zi2Zr) Zery 


That is, ne regression coefficients [8,, B2,...,B,]’ are unbiasedly estimated by 
(ZioZer) *Zoy and + is estimated by y. Because the definitions B,, B2,..., B, Te- 
main unchanged by the reparameterization in (7-59), their best estimates computed 
from the design matrix Z, are exactly the same as the best estimates com- 
puted from the design matrix Z. Thus, setting 6, = [B), B2,.-.,8,], the linear 
predictor of Y can be written as 


5 = Be t+ Bi(z — 2) =F + y'Zeo(ZigZer) (2 ~ 2) (7-61) 


with (z — Z) = [z, — %, 22 — Z,-..,Z, — Z,]’. Finally, 


Var (Bs) Cov(Bs, B.) La an 0’ 
gout , =(Z'Z,) oF =| 7n (7-62) 
Cov (B.; Bs) Cov(B,) 0 (ZioZ) oe? 


4!2 Chapter 7 Multivariate Linear Regression Models 


Comment. The multivariate multiple regression model yields the same mean 
corrected design matrix for each response. The least squares estimates of the coef. 
cient vectors for the ith response are given by 


zs Yi) 


~ Bi =| -———— > €=1,2,...,m (7-63) 


Sometimes, for even further numerical stability, “standardized” input variables _ 


(zi - a)/ IS (2; - ZY = (Gi - Z)/V(n — 1) 5:2, are used. In this case, the 


J=1 fm 
slope coefficients @; in the regression model are replaced by 8, = B;V(n = 1)s,.. 
The least squares estimates of the beta coefficients Bi become f; = BV (n — 1)s,, ; 
i = 1,2,...,r. These relationships hold for each response in the multivariate multiple 


regression situation as well. 


Relating the Formulations 


When the variables Y, Z;, Z),..., Z, are jointly normal, the estimated predictor of Y 
(see Result 7.13) is 
Bo + Bin =¥ + seySzo(z ~ 7) = hy + Gey 2zh(2 ~ fiz) (7-64) 
where the estimation procedure leads naturally to the introduction of centered z's 
Recall from the mean corrected form of the regression model that the best lin- 
ear predictor of Y [see (7-61)] is 
5 = Bs + Bz ~7) 
with B. = 7 and B! = y'Z,o(Zi2Z<2) '- Comparing (7-61) and (7-64), we see that 
+= } = Boand B, = B since’ 
SpySziz = Y'Ze2(Zi2Le2) | (7-65) 
Therefore, both the normal theory conditional mean and the classical regression 
model approaches yield exactly the same linear predictors. 


A similar argument indicates that the best linear predictors of the responses in 
the two multivariate multiple regression setups are also exactly the same. 


Example 7.15 (Two approaches yield the same linear predictor) The computer data with 
the single response Y, = CPU time were analyzed in Example 7.6 using the classical lin- 
ear regression model. The same data were analyzed again in Example 7.12, assuming 
that the variables Y,, Z,, and Z2 were jointly normal so that the best predictor of Y; is 
the conditional mean of ¥; given z, and 2). Both approaches yielded the same predictor, 


y = 8.42 + 1.082, + 4222 a 


7 The identify in (7-65) is established by writing y = (y — 1) + 1 so that 
y'Ze2 iy hi FL)'Ze2 a WZ 2 = {y = Y)'Z2 +0 = (y ~ yl)'Z.2 
Consequently, 
YZe9(ZigZeo) | = (y — Y)'ZelZizLer) | = (a - Uspyi(n ~ 1)Sz2l! = shySzk 


Hib ee W i ia 


Multiple Regression Models with Time Dependent Errors 413 


Although the two formulations of the linear prediction problem yield the same 
predictor equations, conceptually they are quite different. For the model in (7-3) or 
(7-23), the values of the input variables are assumed to be set by the experimenter. 
In the conditional mean model of (7-51) or (7-53), the values of the predictor vari- 
ables are random variables that are observed along with the values of the response 
variable(s). The assumptions underlying the second approach are more stringent, 
but they yield an optimal predictor among all choices, rather than merely among 
linear predictors. 

We close by noting that the multivariate regression calculations in either case 
can be couched in terms of the sample mean vectors y and Z and the sample sums of 
squares and cross-products: 


n ton 
Seip yyel WV (a 2) i 
A BT) BO Ne) | vm [vee 
3-H N' D-Day] LAe¥e Pate 
iF a as 
yy yz 
= fl Seer 
Zzy | Xzz 


This is the only information necessary to compute the estimated regression coeffi- 
cients and their estimated covariances. Of course, an important part of regression 
analysis is model checking. This requires the residuals (errors), which must be calcu- 
lated using all the original data. 


7.10 Multiple Regression Models with Time Dependent Errors 


For data collected over time, observations in different time periods are often relat- 
ed, or autocorrelated. Consequently, in a regression context, the observations on the 
dependent variable or, equivalently, the errors, cannot be independent. As indicated 
in our discussion of dependence in Section 5.8, time dependence in the observations 
can invalidate inferences made using the usual independence assumption. Similarly, 
inferences in regression can be misleading when regression models are fit to time 
ordered data and the standard regression assumptions are used. This issue is impor- 
tant so, in the example that follows, we not only show how to detect the presence of 
time dependence, but also how to incorporate this dependence into the multiple re- 
gression model. 


Example 7.16 (Incorporating time dependent errors in a regression model) Power 
companies must have enough natural gas to heat all of their customers’ homes and 
businesses, particularly during the coldest days of the year. A major component of 
the planning process is a forecasting exercise based on a model relating the send- 
outs of natural gas to factors, like temperature, that clearly have some relationship 
to the amount of gas consumed. More gas is required on cold days. Rather than 
use the daily average temperature, it is customary to use degree heating days 


414 Chapter 7 Multivariate Linear Regression Models 


(DHD) = 65 deg — daily average temperature. A large number for DHD indi- 
cates a cold day. Wind speed, again a 24-hour average, can also be a factor 
in the sendout amount. Because many businesses close for the weekend, the 
demand for natural gas is typically less on a weekend day. Data on these variables 
for one winter in a major northern city are shown, in part, in Table 7.4. (See 
website: www.prenhall.com/statistics for the complete data set. There are n = 63 
observations.) 


Table 7.4 Natural Gas Data 
Y Z, Zz Z3 Z4 
Sendout DHD DHDLag Windspeed Weekend a 
227 32 30 12 1. 
236 31 32 8 1 : 
228 30 31 8 -0 
252 34 30 8 0 
238 28 34 12 0 
333 46 41 8 0 
266 33 46 8 0 
280 38 33 18 0 
386 52 38 22 0 
415 57 52 18 0 


Initially, we developed a regression model relating gas sendout to degree 
heating days, wind speed and a weekend dummy variable. Other variables likely 
to have some affect on natural gas consumption, like percent cloud cover, are 
subsumed in the error term. After several attempted fits, we decided to include 
not only the current DHD but also that of the previous day. (The degree heating 
day lagged one time period is denoted by DHDLag in Table 7.4.) The fitted 
model is 


Sendout = 1.858 + 5.874 DHD +.1.405 DHDLag 
+ 1.315 Windspeed — 15.857 Weekend 


with R? = ,952. All the coefficients, with the exception of the intercept, are signifi- 
cant and it looks like we have a very good fit. (The intercept term could be dropped. 
When this is done, the results do not change substantially.) However, if we calculate 
the correlation of the residuals that are adjacent in time, the lag 1 autocorrelation, 
we get 


>, 88-1 


j=2 


lag 1 autocorrelation = r)(é) = 52 


Multiple Regression Models with Time Dependent Errors 415 


The value, .52, of the lag 1 autocorrelation is too large to be ignored. A plot of 
the residual autocorrelations for the first 15 lags shows that there might also be 
some dependence among the errors 7 time periods, or one week, apart. This amount 
of dependence invalidates the #-tests and P-values associated with the coefficients in 
the model. 

The first step toward correcting the model is to replace the presumed indepen- 
dent errors in the regression model for sendout with a possibly dependent series of 
noise terms Nj. That is, we formulate a regression model for the N; where we relate 
each N; to its previous value N,_1, its value one week ago, N;_7, and an independent 
error €;. Thus, we consider 


Nj = b1Nj-1 + b7Nj-7 + 8; 


where the ¢; are independent normal random variables with mean 0 and variance 
ao’. The form of the equation for N; is known as an autoregressive model. (See [8].) 
The SAS commands and part of the output from fitting this combined regression 
model for sendout with an autoregressive model for the noise are shown in Panel 7.3 
on page 416. 

The fitted model is 


Sendout = 2.130 + 5.810 DHD + 1.426 DHDLag 
+ 1.207 Windspeed — 10.109 Weekend 


and the time dependence in the noise terms is estimated by 
N, = ATONj-1 + .240N,-7 + 6; 


The variance of « is estimated to be G? = 228.89. 

From Panel 7.3, we see that the autocorrelations of the residuals from the en- 
riched model are all negligible. Each is within two estimated standard errors of 0. 
Also, a weighted sum of squares of residual autocorrelations for a group of consec- 
utive lags is not large as judged by the P-value for this statistic. That is, there is no 
reason to reject the hypothesis that a group of consecutive autocorrelations are si- 
multaneously equal to 0. The groups examined in Panel 7.3 are those for lags 1-6, 
1-12, 1-18, and 1-24. 

The noise is now adequately modeled. The tests concerning the coefficient of 
each predictor variable, the significance of the regression, and so forth, are now 
valid.® The intercept term in the final model can be dropped. When this is done, 
there is very little change in the resulting model. The enriched model has better 
forecasting potential and can now be used to forecast sendout of natural gas for 
given values of the predictor variables. We will not pursue prediction here, since it 
involves ideas beyond the scope of this book. (See [8].) = 


’These tests are obtained by the extra sum of squares procedure but applied to the regression plus 
autoregressive noise model. The tests are those described in the computer output. 


416 Chapter7 Multivariate Linear Regression Models 


When modeling relationships using time ordered data, regression models with 
noise structures that allow for the time dependence are often useful. Madern soft. 
ware packages, like SAS, allow the analyst to easily fit these expanded madels. 


PANEL 7.3 SAS ANALYSIS FOR EXAMPLE 7.16 USING PROC ARIMA 


data a; 
infile ‘T7-4.dat’; 
time =_n_| 


input obsend dhd dhdiag wind xweekend; 


proc arima data = a; 
identify var = obsend crosscor = ( PROGRAM COMMANDS 


dhd dhdilag wind xweekend ); 

estimate p = (1 7) method = ml input = ( 
dhd dhdlag wind xweekend ) plot; 

estimate p = (1 7) noconstant method = ml input = ( 
dhd dhdlag wind xweekend ) plot; 


—__—__—_ 
ARIMA Procedure 


Maximum Likelihood Estimation OUTPUT 
Approx. 
Parameter Estimate Std Error T Ratio Lag Variable Shift 
MU 2.12957 13.12340 0.16 0 OBSEND 0 
AR1, 1 0.47008 0.11779 3.99 1 OBSEND ie) 
AR1, 2 0.23986 0.11528 2.08 7 OBSEND it) 
NUM1 5.80976 0.24047 24.16 0 DHD 0 
NUM2 1.42632 0.24932 5.72 0 DHDLAG 0 
NUM3 1.20740 0.44681 2.70 ie} WIND i) 
NUM4 ~10.10890 6.03445 -1.68 0 XWEEKEND 0 
Constant Estimate = 0.61770069 
Variance Estimate = 228.894028 
Std Error Estimate = 15.1292441 
AIC = 528.490321 
SBC = 543.492264 
Number of Residuals = 63 
Autocorrelation Check of Residuals 
To Chi Autocorrelations 
Lag Square DF 
6 6.04 4 196 | 0.079 0.012 0.022 0.192 0.127 0.161 
12 10.27 10 0. 0.144 0.067 0.111 0.056 0.056 0.108 
18 15.92 16 0. 0.013 0.106 0.137 0.170 -0.079 0.018 
24 23.44 22 0, 0.018 0.004 0.250 0.080 0.069 -0.051 


(continues on next page) 


PANEL 7.3 (continued) 


Multiple Regression Models with Time Dependent Errors 417 


Lag Covariance 
0 228.894 
1 18.194945 
2 2.763255 
3 5.038727 
4 44.059835 
5 -29.118892 
6 36.904291 
7 33.00885B 
8 -15.424015 
9 ~25.379057 

10 ~12.890888 
11 -12.777280 
12 -24.825623 
13 2.970197 
14 24.150168 

-31.407314 


Autocorrelation Plot of Residuals 


Correlation 
1.00000 
0.07949 
0.01207 
0.02201 
0.19249 

0.12722 
0.16123 
0.14421 

0.067 38 

0.11088 

0.05632 

0.05582 

-0.10846 
0.01298 
0.10551 

0.13721 


-198765432101234567891 


wom 


[et taadaaaateree keene 


|** 


I 

| 

| 

J 

J 

1 

. 1 
*| . I 
J 

I 

| 

| 

J 

I 

! 


marks two standard errors 


Supplement 


THE DISTRIBUTION OF THE LIKELIHOOD 
RATIO FOR THE MULTIVARIATE 
MULTIPLE REGRESSION MODEL 


The development in this supplement establishes Result 7.11. 

We know that n&=Y'(I~2(Z'Z)'Z')¥Y and under Hy, n& = 
YI — Z,(ZiZ1)'ZiJY with Y =Z,By)+ €. Set P= [kh ~ 2(Z'Z)"Z', 
Since 0 = (I — Z(Z'Z)'Z']Z = [¥ — Z(Z'Z)'Z'}[Z, | Zo) = [PZ, } PZ] the 
columns of Z are perpendicular to P. Thus, we can write 

= (ZB + €)'P(ZB + €) = E'PE 
nd, = (Z, Ba) + €)'P(Z:B 1) + £) = E'P\E 
where P, = I — Z;(Z1Z;)"'Z}. We then use the Gram-Schmidt process (see Re- . 
sult 2A.3) to construct the orthonormal vectors [g1,82..-.,89+1] = G from the 
columns of Z,. Then we continue, obtaining the orthonormal set-from [G, Z|, and 


finally complete the set to 7 dimensions by constructing an arbitrary orthonormal 
set of n — r ~ 1 vectors orthogonal to the previous vectors. Consequently, we have 


eer 


81, 82.--- Rgt1> Bq+2,Bqt3>-- > Brtis Br+2>Br+39--- Bn 
——_— 


from columns fromcolumns of Z, arbitrary set of 
of Z; but perpendicular orthonormal 
to columns of Z, vectors orthogonal 
to columns of Z 
Let (A,e) be an eigenvalue-eigenvector pair of Z ((ZiZ;) Zi. Then, since 
[Z5(ZiZ) Zi }[Zy(ZiZ) Zi] = Z,(ZZ,) °Zy, it follows that 


Ae = Z,(ZiZ,) ‘Zje = (Z,(Z)Z,) -Zi)’e = A(Z(Z4{Z,) Zi)e = re 


418 


The Distribution of the Likelihood Ratio for the Multivariate Multiple Regression Model 419 


and the eigenvalues of Z,(ZjZ,) -Z{ are 0 or 1. Moreover, tr (Zy(ZiZ4) Z4) 
' -ly, a om = 

= tr((Z{Z,) ZZ) = ( eytigen) = qt 1=Aytagt-- + Ager, where 

Ay AQ BBA, 


g+1 > 0 are the eigenvalues of Z,(Z\Z,) \Z;. This shows that 
Z,(Z,Z,) 'Z has q + J eigenvalues equal to 1. Now, (Z;(ZiZ) |Z) Z, = Zy, so 
any linear combination Z,be of unit length is an eigenvector corresponding to the 
eigenvalue 1. The orthonormal vectors ge, € = 1,2,...,q + 1, are therefore eigen- 
vectors of Z,(ZiZ1) Zi , Since they are formed by taking particular linear combi- 
nations of the columns of Z,. By the spectral decomposition (2-16), we have 


qtl 
Z,(Z4Z1) Zi = SY geg’. Similarly, by writing (Z(Z'Z)'Z') Z = Z, we readily see 
é=1 


that the linear combination Zbe = ge, for example, is an eigenvector of Z (Z'Z)'Z' 
r+ 


with eigenvalue A = 1, so that Z(Z'Z) Z’ = SD geee. 
f=1 


Continuing, we have PZ = [I —- Z(Z'Z) ‘Z']Z = Z — Z=0 so gy = Zhe, 
€ =r + 1, are eigenvectors of P with eigenvalues A = 0. Also, from the way the gp, 
€ > r+ 1, were constructed, Z’g; = 0, so that Pg, = ge. Consequently, these gp’s 
are eigenvectors of P corresponding to the n — r — 1 unit eigenvalues. By the spec- 

n 
tral decomposition (2-16),P = ) geg and 
e742 


=r+ 


é=r+ 


n n 
ni =E'PE= 5S) (8'g)(E'ge)’ = DS VeVe 
f=r+2 2 


where, because Cov(V;,Vjx) = E(geeyet.8j) = Tx8e8; = 0, € # j, the E’ge = 
Ve = [Voey,---,Ve;,--.,Vem]’ are independently distributed as N,,(0, 2). Conse- 
quently, by (4-22), n& is distributed as W, ,_,-1(2). In the same manner, 


€>q+1 
Page ~ {8 


0 €sq+1 
n 
so P, = > eRe. We can write the extra sum of squares and crass products as 
€=qt2 
A A r+1 ; r+1 
n(Z,— 2) = €'(P, -~P)e= DY (€'ge)(€'ge) = DY VeVe 
€ag+2 ¢=qt2 


where the V, are independently distributed as N,,(0, 2). By (4-22), n(&y = 2) is 
distributed as W, ,_,(2) independently of nX, since n(Z, — %) involves a different 
set of independent V;’s. 

The large sample distribution for —[n -r~-1- 3 (m —-r+qt 1)|In(|X i/| £,1) 
follows from Result 5.2, with » — vy = m(m + 1)/2 + m(r +1) — m(m + 1)/2 ~- 
m(q + 1) = m(r ~ q) df The use of (n-r-1-}(m-r+q+1)) instead 
of n in the statistic is due to Bartlett [4] following Box [7], and it improves the 
chi-square approximation. 


420 Chapter 7 Multivariate Linear Regression Models 


Exercises 


“1. 


7.2. 


7.3. 


7.4, 


v5. 


Given the data 


z | 0 5 79 1 8 i. 
y 5 9 3 23 7 23 
fit the linear regression model Y, = Bg + B)z,; + &, j = 1,2,...,6. Specifically, Bt 


calculate the least squares setimates B, the fitted values y, the veaaalee é, and the « 
residual sum of squares, é’é. 


Given the data 


fit the regression model 
Y) = Biz: + Boz%2 + &;, j= 1,2,...,6 


to the standardized form (see page 412) of the variables y, z;, and z). From this fit, deduce 
the corresponding fitted regression equation for the original (not standardized) variables. 


(Weighted least squares estimators.) Let 


Y Z Bp +e 
(mX1) (mx (r#1)) ((r#1)x1) (a1) 
where E(e) = 0 but E(ee’) = 0? V, with V(n X n) known and positive definite. For 
V of full rank, show that the weighted least squares estimator is 
Bw = (Z'V"Z)"Z'VyY 
If o? is unknown, it may be estimated, unbiasedly, by 
(n —r -1)1 x (Y - ZByyYVUY — ZBy). 

Hint: VY = (V~/?Z)B + V~"« is of the classical linear regression form Y* = 
Z*B + e*, with E(e*) = O and E(e*e*') = ol. Thus By = B* = (Z*Z*)Z*'Y*. 


Use the weighted Jeast squares estimator in Exercise 7.3 to derive an Span! for 
the estimate Sy the slope B in the model Y; = Bz; 5 + i jj =1,2,...,, when 
(a) Var (e,) = 07, (b) Var(e;) = 0%, and (c) Var(e,) = = 072 2 ‘Comment on n the man- 
ner in which the unequal variances for the errors influence the opémal choice of Bw. 


Establish (7-50): p}(z) = 1 ~ 1/p*”. 
Hint: From (7-49) and Exercise 4.11 


Ovy — OzyS2z8zy _|Xzz\ (cvy — e2vXzz0zv) _—| =| 


1 — p}z) = = = 
*) ory |Xzz| ovy |Xzzlory 


From Result 2A.8(c),o%" = |¥zz|/||, where o”” is the entry of £~ in the first row and 
first column. Since = Exercise 223) p = V2.7 and p™! = (V7? 3.V712) 1= 
v'?2x~!v1/2 , the entry in the (1,1) position of pis p¥¥ = o¥¥ ayy. 


7.6. 


7.7. 


8. 


Exercises 421 


(Generalized inverse of Z'Z) A matrix (Z'Z) is called a generalized inverse of Z'Z if 
Z'Z(Z'Z) ZZ = ZZ. Let 1, +1 =rank(Z) and suppose A, = A, = -+- =A,,41 >0 
are the nonzero eigenvalues of Z’Z with corresponding eigenvectors e1, €2,.-.,&r,+1- 
(a) Show that 


rytl 


(Z'Z) = > Ajlee! 
a i=1 


is a generalized inverse of Z'Z. 

(b) The coefficients B that minimize the sum of squared errors (y — ZB)'(y — ZB) 
satisfy the normal equations (Z'Z) B= Z’y. Show that these equations are satisfied 
for any B such that Z Bi is the projection of y on the columns of Z. 


(c) Show that ZB =Z(Z' Z) Z’y is the projection of y on the columns of Z. (See Foot- 
note 2 in this chapter.) 


(d) Show directly that B = (Z'Z)Z'y is a solution to the normal equations 
(Z'Z)[(Z'Z) Z'y] = Z'y. ; 

Hint: (b) If ZB is the projection, then y — ZB is perpendicular to the columns of Z. 

(d) The eigenvalue-eigenvector requirement implies that (Z'Z) (A;’e,) = e; fori Sr, +1 
and 0 = e/(Z’Z)e, for i > r, + 1. Therefore, (Z’Z) (A;'e)e/Z’= e,e/Z'. Summing 
over i gives 


rytl 
(Z'Z)(Z'Z) Z' = ves d; lee ez 


= eS cet Jz -(§ ee cz Zi =1Z! = Z’ 


i=l i=] 


since eZ’ = Ofori > r, + 1. 
Suppose the classical regression model is, with rank (Z) = r + 1, written as 


Y= Buy + Z Bay + € 
(nx) (nx(qtt)) ((g4)1) (nx(r=9)) (r= x) (nx!) 


where rank(Z,) = g + 1 and rank(Z2) = r — q. If the parameters B (2) are identified 
beforehand as being of primary interest, show that a 100(1 — a)% confidence region for 


Ba) is given by 
(Be2) — Bia)" [ZZ2 ~ ZZ (ZZ) Zi Ze] (By) — By2y) = s°(r — q)F—qn-r-1(@) 
Hint: By Exercise 4.12, with 1’s and 2’s interchanged, 


22 z ! t -ly, -1 ' -1 cl cy 
C%* = (Z,Z2 — Z5Z,(ZiZ,) ZiZ2)", ~~ where (Z'Z)~ = C2) C2 


Multiply by the square-root matrix (C??)~/, and conclude that (C?2)~?(B (2) — B2))/e” 
is N(0,1),so that 


(Bia) — Ba) (C22)"(Bay ~ Bey) is 02x24. 


Recall that the hat matrix is defined by H = Z (Z'Z)'Z' with diagonal elements A,;. 
(a) Show that H is an idempotent matrix. [See Result 7.1 and (7-6).] 


n 
(b) Show that 0 < Aj; <1, j = 1,2,...,n, and that > hj; =r +1, where r is the 


number of independent variables in the regression model. (In fact, (1/n) = hj; < 1.) 


422 Chapter 7 Multivariate Linear Regression Models 


(c) Verify, for the simple linear regression model with one independent variable z, that 
the leverage, h;,, is given by 


(2; ra z) 


‘ y(q- 7 


jl 


os 
tt 
aie 
+ 


7.9. Consider the following data on one predictor variable z; and two responses Y; and ¥;; 


au |. -l O -1 2 
y 5 3 4 2 1 
y | 3 -1 -1 2 3 


Determine the least squares estimates of the parameters in the bivariate straight-line re- 
gression model 

Ya = Bar + Buz t ef 

Yj2 = Bor + Bi2z%ji + ej2, =f = 1,2,3,4,5 


Also, calculate the matrices of fitted values Y and residuals € with Y = [y, } yo]. 
Verify the sum of squares and cross-products decomposition 


YY=YY+é&é 
7.10. Using the results from Exercise 7.9, calculate each of the following. 
(a) A 95% confidence interval for the mean response E(Yo;) = Bo; + 814201 corre- 
sponding to Zo; = 0.5 
(b) A 95% prediction interval for the response Yp; corresponding to zo, = 0.5 
(c) A 95% prediction region for the responses Yo; and Yo2 corresponding to zo; = 0.5 


7.11. (Generalized least squares for multivariate multiple regression.) Let A be a positive 
definite matrix, so that dj(B) = (y; — B’z;)'A(y; — B’z,) is a squared statistical 
distance from the jth observation y; to its regression B’z;. Show that the choice 


A n 
B = B = (Z'Z) 'Z'Y minimizes the sum of squared statistical distances, 5; d3(B), 
: 1 


for any choice of positive definite A. Choices for A include 5! and I. rs 
Hint: Repeat the steps in the proof of Result 7.10 with £"! replaced by A. 


7.12. Given the mean Vector and covariance matrix of Y, Z;, and Z,, 


A i eee yy i oby 
B= =| 3] and Y= H = 
Bz ~2 Oxy | Xzz 


determine each of the following. 

(a) The best Jinear predictor By + B1Z,; + B2.Z, 0f Y 
(b) The mean square error of the best linear predictor 
(c) The population multiple correlation coefficient 
(d) The partial correlation coefficient pyz,.z, 


Exercises 423 


7.13. The test scores for college students described in Example 5.5 have 


7.14. 


7.15. 


7.16, 


7.17. 


zy 527.74 5691.34 
Z=|7,|/=| 5469], S=| 60051 126.05 
Z5 25.13 217.25 23.37 23.14 


Assume joint normality. 

(a) Obtain the maximum likelihoad estimates of the parameters for predicting Z, from 
v4) and Z3. 

(b) Evaluate the estimated multiple correlation coefficient Rz,(z,,z,)- 

(c) Determine the estimated partial correlation coefficient Rz,,z,-z,- 

Twenty-five portfolio managers were evaluated in terms of their performance. Suppase 

Y represents the rate of return achieved over a period of time, Z, is the manager’s atti- 

tude toward risk measured on a five-point scale from “very conservative” to “very risky,” 

and Z, is years of experience in the investment business. The observed correlation coef- 

ficients between pairs of variables are 


ae a 
10 -35 82 


R =| -35 10: -.60 
82 -.60 1.0 
(a) Interpret the sample correlation coefficients ryz, = —.35 and ryz, = —.82. 


(b) Calculate the partial correlation coefficient ryz,.z, and interpret this quantity with 
respect to the interpretation provided for ryz, in Part a. 


The following exercises may require the use of a computer. 


Use the real-estate data in Table 7.1 and the linear regression model in Example 7.4. 

(a) Verify the results in Example 7.4. 

(b) Analyze the residuals to check the adequacy of the model. (See Section 7.6.) 

(c) Generate a 95% prediction interval for the selling price (Yo) corresponding to total 
dwelling size 2; = 17 and assessed value z2 = 46. 

(d) Carry out a likelihood ratio test of Hy: B2 = 0 with a significance level of a = .05. 
Should the original model be modified? Discuss. 


Calculate a C, plot corresponding to the possible linear regressions involving the 
real-estate data in Table 7.1. 


Consider the Forbes data in Exercise 1.4. 


(a) Fit-a linear regression model to these data using profits as the dependent variable 
and sales and assets as the independent variables. 

(b) Analyze the residuals to check the adequacy of the madel. Compute the leverages 
associated with the data points. Does one (or more) of these companies stand out as 
an outlier in the set of independent variable data points? 

(c) Generate a 95% prediction interval for profits corresponding to sales of 100 (billions 
of dollars) and assets of 500 (billions of dollars). 

(d) Carry out a likelihood ratio test of Ho: 82 = 0 with a significance level of a = .0S. 
Should the original medel be madified? Discuss. 


424 Chapter 7 Multivariate Linear Regression Models 


7.18. Calculate 


(a) a C, plot corresponding to the possible regressions involving the Forbes data ; in 
Exercise 1.4. 


(b) the AIC for each possible regression. 


7.19. Satellite applications motivated the development of a silver-zinc battery. Table 1s. 
contains failure data collected to characterize the performance of the battery during its - 
life cycle. Use these data. 

(a) Find the estimated linear regression of n(¥) on an appropriate (“best”) subset of. 
predictor variables. 

(b) Plot the residuals from the fitted model chosen in Part a to check the normat- 
assumption. e 


Table 7.5 Battery-Failure Data 
A Z, Z, Z4 Zs y | 
Depth of End of 
Charge Discharge discharge charge 
rate rate (% of rated Temperature voltage Cycles to 
(amps) (amps) ampere-hours) (°C) (volts) failure 
375 3.13 60.0 40 2.00 ~101 
1.000 3.13 76.8 30 1.99 141 
1.000 3.13 60.0 20 2.00 : 96 
1.000 3.13 60.0 : 20 1.98 125 
1.625 3.13 43.2 10 2.01 43 
1.625 3.13 60.0 20 2.00 16 
1.625 3.13 60.0 20 2.02 188 
375 5.00 76.8 10 2.01 10 
1.000 5.00 43.2 10 1.99 3 
1.000 5.00 43.2 30 2.01 386 
1.000 5.00 100.0 20 2.00 45 
1.625 5.00 76.8 10 1.99 2 
375 1.25 76.8 10 2.01 76 
1.000 1.25 43.2 10 1.99 78 
1.000 1.25 76.8 30 2.00 160 
1.000 1.25 60.0 0 2.00 3 
1.625 1.25 43.2 30 1.99 216 
1.625 1.25 60.0 20 2.00 73 
375 3.13 76.8 30 1.99 314 
375 3.13 60.0 20 2.00 170 
Source: Selected from S. Sidik, H. Leibecki, and J. Bozek, Failure of Silver-Zinc Cells with Competing 
Failure Modes—Preliminary Data Analysis, NASA Technical Memorandum 81556 (Cleveland: Lewis Research 
Center, 1980). 


7.20. Using the battery-failure data in Table 7.5, regress In(Y) on the first principal compo- 
nent of the predictor variables 2,, z2,...,25. (See Section 8.3.) Compare the result with 
the fitted model obtained in Exercise 7.19(a). 


7.21, 


7.22. 


7.23. 


0.24. 


ExerciSes 425 


Consider the air-pollution data in Table 1.5. Let Y; = NO, and Y, = Oy be the two 
responses (pollutants) corresponding to the predictor variables Z, = wind and 
Z, = solar radiation. 


(a) Perform a regression analysis using only the first response Yj. 
(i) Suggest and fit appropriate linear regression madels. 
(ii) Analyze the residuals. 


(iii) Construct a 95% prediction interval for NO, corresponding to z, = 10 and 
Z2 = 80. 


b) Perform a multivariate multiple regression analysis using both responses Y; and ¥). 
i y P' 
(i) Suggest and fit appropriate linear regression models. 
(ii) Analyze the residuals. 


(iii) Construct a 95% prediction ellipse for both NO, and O; for z; = 10 and z2 = 80. 
Compare this ellipse with the prediction interval in Part a (iii). Comment. 


Using the data on bone mineral content in Table 1.8: 


(a) Perform a regression analysis by fitting the response for the dominant radius bone to 
the measurements on the last four bones. 


(i) Suggest and fit appropriate linear regression models. 
(ii) Analyze the residuals. 


(b) Perform a multivariate multiple regression analysis by fitting the responses from 
both radius bones. 


(c) Calculate the AIC for the model you chase in (b) and for the full model. 
Using the data on the characteristics of bulls sold at auction in Table 1.10: 


(a) Perform a regression analysis using the response ¥; = SalePr and the predictor vari- 
ables Breed, YrHgt, FtFrBody, PrctFFB, Frame, BkFat, SaleHt, and Sale Wt. 


(i) Determine the “best” regression equation by retaining only those predictor 
variables that are individually significant. 


(ii) Using the best fitting model, construct a 95% prediction interval for selling 
price for the set of predictor variable values (in the order listed above) 5, 48.7, 
990, 74.0, 7, .18, 54.2 and 1450. 


(iii) Examine the residuals from the best fitting model. 

(b) Repeat the analysis in Part a, using the natura] logarithm of the sales price as the 
response. That is, set ¥; = Ln(SalePr). Which analysis do you prefer? Why? 

Using the data on the characteristics of bulls sold at auction in Table 1.10: 


(a) Perform a regression analysis, using only the response Y, = SaleHt and the predic- 
tor variables Z; = YrHgt and Z, = FtFrBady. 


(i) Fit an appropriate mode} and analyze the residuals. 
(ii) Construct a 95% prediction interval for SaleHt corresponding to z, = 50.5 and 
13: = 970. 
(b) Perform a multivariate regression analysis with the responses Y, = SaleHt and 
Y, = SaleWt and the predictors Z, = YrHgt and Z, = FtFrBody. 
Gi) Fit an appropriate multivariate model and analyze the residuals. 
(ii) Construct a 95% prediction ellipse for both SaleHt and SaleWt for z, = 50.5 


and z) = 970. Compare this ellipse with the prediction interval in Part a (ii). 
Comment. 


426 Chapter 7 Multivariate Linear Regression Models 


7.25. Amitriptyline is prescribed by some physicians as an antidepressant. However, there 
are also conjectured side effects that seem to be related to the use of the drug: irregular 
heartbeat, abnormal blood pressures, and irregular waves on the electrocardiogram 
among other things, Data gathered on 17 patients who were admitted to the hospital 
after an amitriptyline overdose are given in Table 7.6. The two response variables - 
are > 


Y, = Total TCAD plasma level (TOT) 
Y, = Amount of amitriptyline present in TCAD plasma level (AMI) 
The five predictor variables are _ 
Z, = Gender: 1 if female, 0 if male (GEN) 
Z, = Amount of antidepressants taken at time of overdose (AMT) 
Z3 = PR wave measurement (PR) 
Z, = Diastolic blood pressure (DIAP) 
Zs = QRS wave measurement (QRS) m 


Table 7.6 Amitriptyline Data 


1 
1 
0 
1 
1 
1 
1 
0 
1 
1 
1 
1 
1 
0 
0 
0 
1 


Source: See [24]. 


(a) Perform a regression analysis using only the first response Yj. 
(i) Suggest and fit appropriate linear regression models. 
(ii) Analyze the residuals. 
(iii) Construct a 95% prediction interval for Total TCAD for z; = 1, 2) = 1200, 
23 = 140, z4 = 70, and z, = 85. - 
(b) Repeat Part a using the second response ¥3. 


Exercises 427 


(c) Perform a multivariate multiple regression analysis using both responses Yj and Y. 
(i) Suggest and fit appropriate linear regression models. 
(ii) Analyze the residuals. 
(iii) Construct a 95% prediction ellipse for both Total TCAD and Amount of 
amitriptyline for 2} = 1, 2; = 1200, 23 = 140, z4 = 70, and z, = 85. Compare 
this ellipse with the prediction intervals in Parts a and b. Comment. 


7.26. Measurements of properties of pulp fibers and the paper made from them are contained 
in Table 7.7 (see also [19] and website: www.prenhall.com/statistics). There are n = 62 
observations of the pulp fiber characteristics, z; = arithmetic fiber length, z. = long 
fiber fraction, z; = fine fiber fraction, z, = zero span tensile, and the paper properties, 
\ = breaking length, » = elastic modulus, y3 = stress at failure, y, = burst strength. 


Table 7.7 Pulp and Paper Properites Data 


Source: See Lee [19]. 


(a) Perform a regression analysis using each of the response variables Y,, ¥2, Y3 and Y4. 
(i) Suggest and fit appropriate linear regression models. 
(ii) Analyze the residuals. Check for outliers or observations with high leverage. 
(iii) Construct a 95% prediction interval for SF (¥3) for z; = 330, z2 = 45.500, 
23 = 20.375, 24 = 1.010. 
(b) Perform a multivariate multiple regression analysis using all four response variables, 
Yj, Yo, Y3 and Y4, and the four independent variables, Z,, Z2, Z3 and Z,. 
(i) Suggest and fit an appropriate linear regression model. Specify the matrix of 
estimated coefficients 8 and estimated error covariance matrix Y. 
(ii) Analyze the residuals. Check for outliers. 
(iii) Construct simultaneous 95% prediction intervals for the individual responses 
Yo, / = 1,2,3,4, for the same settings of the independent variables given in part 
a (iii) above. Compare the simultaneous prediction interval for Yo; with the 
prediction interval in part a (iii). Comment. 


v.27. Refer to the data on fixing breakdowns in cell phone relay towers in Table 6.20. In the 
initial design, experience level was coded as Novice or Guru. Now consider three levels 
of experience: Novice, Guru and Experienced. Some additional runs for an experienced 
engineer are given below. Also, in the original data set, reclassify Guru in run 3 as 


428 Chapter 7 Multivariate Linear Regression Models 


Experienced and Novice in run 14 as Experienced. Keep all the other numbers for thege = 
two engineers the same. With these changes and the new data below, perform a muitj. 4 
variate multiple regression analysis with assessment and implementation times as the. 
responses, and problem severity, problem complexity and experience level as the pr edictor.” 
variables. Consider regression models with the predictor variables and two factor inte... 
action terms as inputs. (Note: The two changes in the original data set and the additiona} ~ 
data below unbalances the design, so the analysis is best handled with regression * 
methods.) a 


ad 


Problem Problem Engineer Problem - Problem Total 
severity complexity experience assessment implementation resolution 
level level level time time time 

Low Complex Experienced 5.3 9.2 14.5 : 
Low Complex Experienced 5.0 10.9 15.9 ; 
High Simple Experienced 4.0 8.6 12.6 

High Simple Experienced 45 8.7 13.2 

High Complex Experienced 6.9 14.9 21.8 —_ 

References 


we 


11. 


12. 


. Abraham, B. and J. Ledolter. Introduction to Regression Modeling, Belmont, CA: 


Thompson Brooks/Cole, 2006. 


. Anderson, T. W. An Introduction to Multivariate Statistical Analysis (3rd ed.). New York: 


John Wiley, 2003. 


. Atkinson, A. C. Plots, Transformations and Regression: An Introduction to Graphical 


Methods of Diagnostic Regression Analysis. Oxford, England: Oxford University Press, 
1986, 


. Bartlett, M. S. “A Note on Multiplying Factors for Various Chi-Squared Approxima- 


tions.” Journal of the Royal Statistical Society (B}, 16 (1954), 296-298. 


. Belsley, D. A., E. Kuh, and R. E. Welsh. Regression Diagnostics: Identifying Influential 


Data and Sources of Collinearity (Paperback). New York: Wiley-Interscience, 2004. 


. Bowerman, B. L., and R.T. O’Connell. Linear Statistical Models: An Applied Approach 


(2nd ed.). Belmont, CA: Thompson Brooks/Cole, 2000. 


. Box, G. E. P “A General Distribution Theory for a Class of Likelihood Criteria.” 


Biometrika, 36 (1949), 317-346, 


. Box, G.E. P., G. M. Jenkins, and G. C, Reinsel. Time Series Analysis: Forecasting and Con- 


trol (3rd ed.). Englewood Cliffs, NJ: Prentice Hall, 1994. 


. Chatterjee, S., A. S. Hadi, and B. Price. Regression Analysis by Example (4th ed.). New 


York: Wiley-Interscience, 2006. 


. Cook, R. D., and S. Weisberg. Applied Regression Including Computing and Graphics. 


New York: John Wiley, 1999. 

Cook, R. D., and S. Weisberg. Residuals and Influence in Regression. London: Chapman , 
and Hall, 1982. 

Daniel, C. and F. S. Wood. Fitting Equations to Data (2nd ed.) (papetback), New York: 
Wiley-Interscience, 1999. 5 


13. 


14, 


15. 


16. 
17. 


18. 


19. 


20. 


21. 


22. 


23. 
24. 


References 429 


Draper, N. R., and H. Smith. Applied Regression Analysis (3rd ed.). New York: John 
Wiley, 1998. 


Durbin, J., and G. S. Watson. “Testing for Serial Correlation in Least Squares Regression, 
IL.” Biometrika, 38 (1951), 159-178. 


Galton, F. “Regression Toward Mediocrity in Heredity Stature.” Journal of the Anthro- 
pological Institute, 15 (1885), 246-263. 


Goldberger, A. S. Econometric Theory. New York: John Wiley, 1964. 


Heck, D. L. “Charts of Some Upper Percentage Points of the Distribution of the Largest 
Characteristic Root.” Annals of Mathematical Statistics, 31 (1960), 625-642. 


Khattree, R. and D.N. Naik. Applied Multivariate Statistics with SAS® Software (2nd ed.) 
Cary, NC: SAS Institute Inc., 1999. 


Lee, J. “Relationships Between Properties of Pulp-Fibre and Paper.” Unpublished doc- 
toral thesis, University of Toronto, Faculty of Forestry, 1992. 


Neter, J., W. Wasserman, M. Kutner, and C. Nachtsheim. Applied Linear Regression Mod- 
els (3rd ed.). Chicago: Richard D. Irwin, 1996. 


Pillai, K. C. S.“Upper Percentage Points of the Largest Root of a Matrix in Multivariate 
Analysis.” Biometrika, 54 (1967), 189-193. 


Rao, C. R. Linear Statistical Inference and Its Applications (2nd ed.) (paperback). New 
York: Wiley-Interscience, 2002. 


Seber, G.A.F. Linear Regression Analysis. New York: John Wiley, 1977. 


Rudorfer, M. V. “Cardiovascular Changes and Plasma Drug Levels after Amitriptyline 
Overdose.” Journal of Toxicology-Clinical Toxicology, 19 (1982), 67-71. 


Chapter 


PRINCIPAL COMPONENTS 


8.1 Introduction cs 


A principal component analysis is concerned with explaining the variance—covariance 
structure of a set of variables through a few /inear combinations af these variables. Its 
general objectives are (1) data reduction and (2) interpretation. 

Although p components are required to reproduce the total system variability, 
often much of this variability can be accounted for by a small number k of the prin- 
cipal components. If so, there is (almost) as much information in the A components 
as there is in the original p variables. The k principal components can then replace 
the initial p variables, and the original data set, consisting of 2 measurements on 
p variables, is reduced to a data set consisting of n measurements on & principal 
components. 

An analysis of principal components often reveals relationships that were not 
previously suspected and thereby allows interpretations that would not ordinarily 
result. A good example of this is provided by the stock market data discussed in 
Example 8.5. 

Analyses of principal components are more of a means to an end rather than an 
end in themselves, because they frequently serve as intermediate steps in much 
larger investigations. For example, principal components may be inputs to a multiple 
regression (see Chapter 7) or cluster analysis (see Chapter 12), Moreover, (scaled) 
principal components are ane “factoring” of the covariance matrix for the factor 
analysis model considered in Chapter 9. 


8.2 Population Principal Components 
Algebraically, principal components are particular linear combinations of the p ran- 


dom variables X,, X2,...,X pt Geometrically, these linear combinations represent ” 
the selection of a new coordinate system abtained by rotating the original system . 


430 


Population Principal Components 434 


with X\, X2,..., X, as the coordinate axes. The new axes represent the directions 
with maximum variability and provide a simpler and more parsimonious description 
of the covariance structure. 

As we shall see, principal components depend solely on the covariance 
matrix & (or the correlation matrix 9) of X,, X2,..., X,. Their development does 
not require a multivariate normal assumption. On the other hand, principal 
components derived for multivariate normal populations have useful interpreta- 
tions in terms of the constant density ellipsoids. Further, inferences can be made 
from the sample components when the population is multivariate normal. (See 
Section 8.5.) 

Let the random vector X’ = [X,, X>,..., X,] have the covariance matrix 
with eigenvalues A; = Ap 2---2 A, 20. 

Consider the linear combinations 


Y, = aiX ay a1 X1 + Q32X2 Se a) )X, 

Y, = a,X _ 49, Xy -. 494X9 BEE a2 ,Xp 
. , (8-1) 

Y, = a,X = 4,)X\ + ap2X, +--+ + a,,X, 

Then, using (2-45), we obtain 

Var (Y;) = a}Za, i=1,2,...,p (8-2) 
Cov(Y;,¥,) =ajta, ik =1,2,...,p (8-3) 
The principal components are those uncorrelated linear combinations Y,, Y2,..., Y, 


whose variances in (8-2) are as large as possible. 

The first principal component is the linear combination with maximum 
variance. That is, it maximizes Var (Y,) = aj Za. It is clear that Var(Y,) = aj Xa, can 
be increased by multiplying any a, by some constant. To eliminate this indeterminacy, 
it is convenient to restrict attention to coefficient vectors of unit length. We there- 
fore define 


First principal component = linear combination aX that maximizes 
Var (a, X) subject to aja, = 1 
Second principal component = linear combination a}X that maximizes 
Var (a)X) subject to aja, = 1 and 
Cov (aX, aX) = 0 


At the ith step, 


ith principal component = linear combination a; X that maximizes 
Var (a; X) subject to aja; = 1 and 
Cov(a;X,a,X) = 0 for k<i 


432 Chapter 8 Principal Components 


Result 8.1. Let be the covariance matnx associated with the random Vector © 
X’ = (X,, X2,...,X,]. Let % have the eigenvalue-eigenvector pairs (Ax, e1) E 
(Az, €2),-.-» (Ap, @p) where Ay = Ag = ++ 2 A, = O. Then the ith principal cone £ 
ponent is given by 


Y, = e)K = eX, + eX, +--+ + e4j,X,, §=1,2,...,p (8-4) = 


a 


With these choices, 


Prvab dan dae 


Var(Y;) =e}2e =A; i =1,2,...,p 
Cov(¥;, Y;) = exe, = 0 i#k (8-5 


If some A; are equal, the choices of the corresponding coefficient vectors, e,, and - 
hence Y,, are not unique. ae 


— 
ki. 


te, 


-¢ 


Proof. We know from (2-51), with B = X, that 


a’da , 
max->— =A, (attained when a = e;) 
a*0 aa 


But ee; = 1 since the eigenvectors are normalized. Thus, 
a’Za e; Le; 


max —— =a, = = ejSe, = Var (Y, 
a40 a’a : eje; no (1) 


Similarly, using (2-52), we get 
a'da 


' 
a 


= Anat k =1,2,...,p— 1 


max 
aley,e2,...,e @ 
For the choice a = e,4, with e,,,e; = 0,foré = 1,2,...,kandk = 1,2,...,p —1, 
C1 Le) Ces r€err = Csr cay = Var (Ye41) 


But €447(D@e41) = Agszekes1€k41 = Ace 80 Var(¥e41) = Aya . It remains to show 
that e; perpendicular to e; (that is, eje, = 0,i # k) gives Cov(Y;, Y,) = 0. Now, the 
eigenvectors of © are orthogonal if all the eigenvalues A, A2,..., A, are distinct. If 
the eigenvalues are not all distinct, the eigenvectors corresponding to common 
eigenvalues may be chosen to be orthogonal. Therefore, for any two eigenvectors e; 
and e,, ee, = 0,i # k. Since Le, = A,e,, premultiplication by ej gives 


Cov(¥,, ¥;,) = efZe, = ejA,e, = Ayele, = 0 
for any 1 # &, and the proof is complete. = 


From Result 8.1, the principal components are uncorreJated and have variances 
equal to the eigenvalues of £. 


Result 8.2, Let X’ = [X,, X2,..., X,] have covariance matrix %, with eigenvalue- 
eigenvector pairs (A), €1), (Az,€2),---, (Ap, €p) where A) 2A, = -+- FA,PZ=O. 
Let Y; = ejX, ¥, = e}X,..., Y, = e,X be the principal components. Then 


P 
O14, + 822 sts + Opp = S var(x) =, + A2 + coe Ap = > Var (Y;) 
i= i=1 


Population Principal Components 433 


Proof. From Definition 2A.28, 01; + 022 +--+: + op, = tr(%). From (2-20) with 
A = %, we can write Y = PAP’ where A is the diagonal matrix of eigenvalues and 
P = [e,, e2,...,€,] So that PP’ = P’P = E. Using Result 2A.12(c), we have 


tr(Z) = tr(PAP’) = tr(AP’P) = tr(A) =A, +A, +--+ A, 
Thus, 


> Var(X;) = tr(2) = tr(A) = > Var(Y;) = 
i=1 i=l 


Result 8.2 says that 


Total population variance = 01; + 022 +++ + Opp 
Ay tAg ter + A, (8-6) 


and consequently, the proportion of total variance due to (explained by) the kth 
principal component is 


Proportion of total 


population variance }_ Ae 7 : 
due to kth principal Ap tAg te +A, k=1,2,...,p (8-7) 
component 


If most (for instance, 80 to 90%) of the total population variance, for large p, can be: 
attributed to the first one, two, or three components, then these components can 
“replace” the original p variables without much loss of information. 

Each component of the coefficient vector ef = [e1,..., €ix,---, €p] also merits 
inspection. The magnitude of e;, measures the importance of the kth variable to the 
ith principal component, irrespective of the other variables. In particular, e;, is pro- 
portional to the correlation coefficient between Y; and X,. 


Result 8.3. If Y, = ejX, ¥, = e)X,..., ¥, = e,X are the principal components 
obtained from the covariance matrix %, then 

ein VA 

PY;,X, = aa 

5 VOkk 


are the correlation coefficients between the components Y; and the variables X,. 
Here (Aj, €1), (Az, €2),--., (Ap, p) are the eigenvalue~eigenvector pairs for . 


ik =1,2,...,p (8-8) 


Proof. Set aj = (0,...,0,1,0,...,0] so that X,=a,K and Cov(X,,¥;) = 
Cov (a;X, e{X) = aj Le;, according to (2-45). Since Ze; = A;e;, Cov (X,, Y;) = adie; = 
Ae;,- Then Var(Y;) = A; [see (8-5)] and Var(X;,,) = ox, yield 


Cov(¥;, Xx) Aj€ix en. VAi ik 1 2 P 
== QaISEeeeewweSSaa== FT 1, = 1,2,...5 | | 
Phe Xe tar (¥) VVar(X,) Va Vou Von 


Although the correlations of the variables with the principal components often 
help to interpret the components, they measure only the univariate contribution of 
an individual X to a component Y. That is, they do not indicate the importance of 
an X to a component Y in the presence of the other X’s. For this reason, some 


434 Chapter 8 Principal Components 


statisticians (see, for example, Rencher [16]) recommend that only the coefficients 
ex, and not the correlations, be used to interpret the components. Although the co. 
efficients and the correlations can lead to different rankings as measures of the jm. ’ 
portance of the variables to a given component, it is our experience that these - 
rankings are often not appreciably different. In practice, variables with relatively 
large coefficients (in absolute value) tend to have relatively large correlations, sq 
the two measures of importance, the first multivariate and the second univariate, - 
frequently give similar results. We recommend that both the coefficients and the 
correlations be examined to help interpret the principal components. : 

The following hypothetical example illustrates the contents of Results 8.1, 8.2, _ 
and 8.3. 


Example 8.1 (Calculating the population principal components) Suppose the - 
random variables X,, X, and X3 have the covariance matrix 


1 -2 0 z 
2=;-2 5 0 
0 02 


It may be verified that the eigenvalue~eigenvector pairs are 
A, = 5.83, ey = [.383, -.924, 0] 
A, = 2.00, — e} = [0,0, 1] 
Az = 0.17, — e§ = [.924, 383, 0] 


Therefore, the principal components become 
Y, = ej X = .383.X, — .924.X, 
Y% = eX = X; 
¥, = eX = 924X, + 383X, 
The variable X3 is one of the principal components, because it is uncorrelated with 


the other two variables. 
Equation (8-5) can be demonstrated from first principles. For example, 


Var (¥,) = Var(.383.X, — .924X,) 
= (.383)? Var (X) + (—.924)? Var (X2) 
+ 2(.383) (—.924) Cav(X,, X2) 
= .147(1) + .854(5) — .708(—2) 
= 583 =A 
Cov (.383X, — 924X>, X5) 
.383 Cov (X, X3) — .924 Cov(X2, X3) 
.383(0) ~ .924(0) =0 


Cov(¥;, Yo) 


{t 


it 


i 


It is also readily apparent that 
01; + O22 + 033 =~1 +542 =A, +A, + Az = 5.83 + 2.00 + 17 i 


Population Principal Components 435 


validating Equation (8-6) for this example. The proportion of total variance accounted 
for by the first principal component is Aj/(A, + A2 + A3) = 5.83/8 = .73. Further, the 
first two components account for a proportion (5.83 + 2)/8 = .98 of the population 
variance. In this case, the components Y, and Y, could replace the original three 
variables with little loss of information. 

Next, using (8-8), we obtain 


enVAy _ .383V5.83 _ 


= 925 
i aa vi 

_ 212VAy _ =.924V5.83 08 
Pea Tong V5 


Notice here that the variable X,, with coefficient —.924, receives the greatest 
weight in the component Yj. It also has the largest correlation (in absolute value) 
with Y,. The correlation of X,, with Y;, .925, is almost as large as that for X>, indi- 
cating that the variables are about equally important to the first principal compo- 
nent. The relative sizes of the coefficients of X, and X> suggest, however, that X, 
contributes more to the determination of Y, than does ,. Since, in this case, both 
coefficients are reasonably large and they have opposite signs, we would argue that 
both variables aid in the interpretation of Y;. 


Finally, 
Vin V2 
= =0 and = ——= = -— = 1 (asit should 
PY2,X1 = P¥2,X> PY>,X3 Van V3 (as it s ) 


The remaining correlations can be neglected, since the third component is 
unimportant. , = 


It is informative to consider principal components derived from multivariate 
normal random variables. Suppose X is distributed as N,(“, 2%). We know from 
(4-7) that the density of X is constant on the yw centered ellipsoids 


(x — w)'E "(x - pw) =e? 


which have axes +c VA; e;, i = 1,2,..., p, where the (A;,e;) are the eigenvalue— 
eigenvector pairs of Z. A point lying on the ith axis of the ellipsoid will have coordi- 
nates proportional to e; = [¢;;, €2,..., @p] in the coordinate system that has origin 
# and axes that are parallel to the original axes x1, x,..., xp. It will be convenient 
to set #4 = Oin the argument that follows.! 

From our discussion in Section 2.3 with A = £7), we can write 


ie Lee yh 4b open? 1 yy 
O = Ee = 5 (1x) + 5 (ebx) ately (eee) 


This can be done without loss of generality because the normal random vector X can always be 
translated to the norma] random vector W = X — w« and E(W) = 0. However, Cov(X) = Cov(W). 


436 Chapter 8 Principal Components 


where e| x, e€2Xx,..., €,x are recognized as the principal components of x. Setting 
VY = 1X, Yo = €2X,..., yp = eDx, we have 
2 -—yi+ i 2 + zane 
ry yi a y2 ae Yp 

and this equation defines an ellipsoid (since Aj, p,..., A, are positive) in a coordi. 
nate system with axes y,, y,.-., yp lying in the directions of e€), €2,..., e,, Tespec- 
tively. If A, is the largest eigenvalue, then the major axis lies in the direction e,. The 
remaining minor axes lie in the directions defined by e2,...,e,. 

To summarize, the principal components y,"= e| x, }) = €2X,..-, Yp = e1,x lie 
in the directions of the axes of a constant density ellipsoid. Therefore, any point on 
the ith ellipsoid axis has x coordinates proportional to ej = [e;1, €2,..., & p] and, 
necessarily, principal component coordinates of the form [0,..., 0, y;,0,..., 0]. 


When p # 0, it is the mean-centered principal component y; = ej(x — y) that 
has mean 0 and lies in the direction e;. 

A constant density ellipse and the principal components for a bivariate normal 
random vector with x = 0 and p = .75 are shown in Figure 8.1. We see that the 
principal components are obtained by rotating the original coordinate axes through 
an angle @ until they coincide with the axes of the constant density ellipse. This result 
holds for p > 2 dimensions as well. 


—> 


y) = ex 
Yq = €9x 
x Loxk=e? 
< rx, 
Figure 8.1 The constant density 
ellipse x'£~'x = c? and the principal 
B=0 components y, , yz for a bivariate 
p=.75 4 
normal random vector X having 
mean 0. 


Principal Components Obtained from Standardized Variables 


Principal components may also be obtained for the standardized variables 


= (X% 7 M1) 
Z; a Ven 
z= (X2 — pe) 
2 Von (8-9) 
“(Xp mp) 
Z, == 


Pp 


Population Principal Components 437 


In matrix notation, 
Z = (Vv?) "(X ~ #) (8-10) 


where the diagonal standard deviation matrix V'” is defined in (2-35). Clearly, 
E(Z) = Oand 


Cov(Z) = (WV!) "X(v12)" = p 


by (2-37). The principal components of Z may be obtained from the eigenvectors of 
the correlation matrix p of X. All our previous results apply, with some simplifica- 
tions, since the variance of each Z; is unity. We shall continue to use the notation Y; 
to refer to the ith principal component and ();, e;) for the eigenvalue-eigenvectar 
pair from either p or &. However, the (A;, e;) derived from & are, in general, not the 
same as the ones derived from p. 


Result 8.4. The ith principal component of the standardized variables 
Z' = [Z,,Z,,..-,Z,] with Cov(Z) = p, is given by 


Y¥, = eZ =e(V'2) (x -p), i= 1,2,...,p 
Moreover, 
P 
¥ var(y) = S var(Z,) = p (8-11) 


i=] i=] 
and 


ey,,2z, = %kVA i,k = 1,2,...,p 


In this case, (A;, €)), (Az, €2),---> (Ap, e,) are the eigenvalue—eigenvector pairs for 
fP,with A, =A, 2° 2A, =O. 


Proof. Result 8.4 follows from Results 8.1, 8.2, and 8.3, with Z;, Z),..., Z, in place 
of X,, X2,..., X, and p in place of X. a 


We see from (8-11) that the total (standardized variables) population variance 
is simply p, the sum of the diagonal elements of the matrix p. Using (8-7) with Z in 
place of X, we find that the proportion of total variance explained by the kth princi- 
pal component of Z is 


Proportion of (standardized) d 
population variance due = a k =1,2,...,p (8-12) 
to kth principal component 


where the A,’s are the eigenvalues of p. 


Example 8.2 (Principal components obtained from covariance and correlation matrices 
are different) Consider the covariance matrix 


1 4 
Bo i | 


438 Chapter 8 Principal Components 


and the derived correlation matrix 


e-[h 4 


The eigenvalue-eigenvector pairs from & are 
A, = 100.16, —e} = [.040, 999] 
Ag= 84, — e& = [.999, -.040] 
Similarly, the eigenvalue-eigenvector pairs from p are 


Ap=1+p=14, — ej =[.707,.707] 
Ag=1-p= 6, e = [.707, -.707] 
The respective principal components become 


Y, = .040X, + .999X, 
Y = 999.X4 — .040X, 


and | 
Y, = .707Z, + .207Z, = .707 (2) ee ( 4-1) 
a = 707(X, — p41) + 0707(X2 — pa) 


= 107 (X, = B41) = .0707(X = L) 


Because of its large variance, X, completely dominates the first principal component 
determined from &. Moreover, this first principal component explains a proportion 
Ar _ 100.16 


= = 9 
Ata, 101 is 


of the total population variance. 

When the variables X; and X, are standardized, however, the resulting 
variables contribute equally to the principal components determined from p. Using 
Result 8.4, we obtain 

by,,z, = €11VAy = -707V14 = 837 
and 
Py,,2, = €21VAq = .707V14 = .837 


In this case, the first principal component explains a proportion 
Ala, 
Pp 2 


of the total (standardized) population variance. 
Most strikingly, we see that the relative importance of the variables to, for 
instance, the first principal component is greatly affected by the standardization. 


Population Principal Components 439 


When the first principal component obtained from p is expressed in terms of X; 
and X2, the relative magnitudes of the weights .707 and .0707 are in direct opposi- 
tion to those of the weights .040 and .999 attached to these variables in the principal 
component obtained from &. r 


The preceding example demonstrates that the principal components derived 
from & are different from those derived from p. Furthermore, one set of principal 
components is not a simple function of the other. This suggests that the standardiza- 
tion is not inconsequential. 

Variables should probably be standardized if they are measured on scales with 
widely differing ranges or if the units of measurement are not commensurate. For 
example, if X, represents annual sales in the $10,000 to $350,000 range and X is the 
ratio (net annual income)/(total assets) that falls in the .01 to .60 range, then the 
total variation will be due almost exclusively to dollar sales. In this case, we would 
expect a single (important) principal component with a heavy weighting of X. 
Alternatively, if both variables are standardized, their subsequent magnitudes will 
be of the same order, and X, (or Z)) will play a larger role in the construction of the 
principal components. This behavior was observed in Example 8.2. 


Principal Components for Covariance Matrices 
with Special Structures 


There are certain patterned covariance and correlation matrices whose principal 
components can be expressed in simple forms. Suppose & is the diagonal matrix 


1, 0 = 0 
Sel) eee (8-13) 
0 0 opp 


Setting ej = [0,...,0,1,0,...,0], with 1 in the ith position, we observe that 


0 0 
a1, 0 0 : ; 
y on 0 1| =] 1o0,,;| or Ze; = o,;e; 
0 0 + ol[of fo] 

o} Lo 


and we conclude that (0;;, e;) is the ith eigenvalue-cigenvector pair. Since the linear 
combination ej X = X;, the set of principal components is just the original set of un- 
correlated random variables. 

For a covariance matrix with the pattern of (8-13), nothing is gained by extracting 
the principal components. From another point of view, if X is distributed as N,(#, ), 
the contours of constant density are ellipsoids whase axes already lie in the directions 
of maximum variation. Consequently, there is no need to rotate the coordinate system. 


440 Chapter 8 Principal Components 


Standardization does not substantially alter the situation for the ¥ in (8-13). In 
that case, p = I, the p x p identity matrix. Clearly, pe; = 1e,, so the eigenvalue 1 
has multiplicity p and e; = [0,...,0,1,0,...,0], £=1,2,..., p, are convenient 
choices for the eigenvectors. Consequently, the principal components determined 
from p are also the original variables Z;,...,Z,. Moreover, in this case of equal 
eigenvalues, the multivariate normal ellipsoids ob constant density are spheroids. 

Another patterned covariance matrix, which often describes the correspon-_ 
dence among certain biological variables such as the sizes of living things, has the 
general form 


oe? pot ++" po* 
fal a ee (8:14) 
po? pot og? 
The resulting correlation matrix 
lp p 
pelt, os (8-15) 
ee eae | 


is also the covariance matrix of the standardized variables. The matrix in (8-15) 
implies that the variables X,, X2,..., X, are equally correlated. 

It is not difficult to show (see Exercise 8.5) that the p eigenvalues of the corre- 
lation matrix (8-15) can be divided into two groups. When p is positive, the largest is 


A, =1+(p-1)p (8-16) 
with associated eigenvector 
1 1 1 
f= |) SSS 8-17 
ane “ 
The remaining p ~ 1 eigenvalues are 
Ay = Ag =o = Ap= 1 -p 


and one choice for their eigenvectors is 


1 1 
eas [ater ote a 


ay ee 
ESE VERE VERS “ 


be: — 1 == 1) 0.0 
a far "VE-Di VGaiy 


1 ~(p-1) | 
ear ~1)p 'Vip-lp Vip- 1p 


Summarizing Sample Variation by Principal Components 441 


The first principal component 


is proportional to the sum of the p standarized variables It might be regarded as an 
“index” with equal weights. This principal component explains a proportion 


A, _ 1+ (p-1)p _ 1-p 
of ——_——— = p + —— 
Pp Pp Pp 


(8-18) 


of the total population variation. We see that A,/p = p for p close to 1 or p large. 
For example, if p = .80 and p = 5, the first component explains 84% of the 
total variance. When p is near 1, the last p — 1 components collectively con- 
tribute very little to the total variance and can often be neglected. In this special 
case, retaining only the first principal component ¥, = (1/Vp)[1,1,..-,1]X, 
a measure of total size, still explains the same proportion (8-18) of total 
variance. 

If the standardized variables Z,, Z,,...,Z, have a multivariate normal distrib- 
ution with a covariance matrix given by (8-15), then the ellipsoids of constant densi- 
ty are “cigar shaped,” with the major axis proportional to the first principal 
component ¥, = (1/Vp){1,1,...,1]Z. This principal component is the projection 
of Z on the equiangular line 1’ = [1,1,..., 1]. The minor axes (and remaining prin- 
cipal components) occur in spherically symmetric directions perpendicular to the 
major axis (and first principal component). 


8.3 Summarizing Sample Variation by Principal Components 


We now have the framework necessary to study the problem of summarizing the 
variation in m measurements on p variables with a few judiciously chosen linear 
combinations. 

Suppose the data x,,x2,...,x, represent n independent drawings from some 
p-dimensional population with mean vector w and covariance matrix &. These data 
yield the sample mean vector x, the sample covariance matrix S, and the sample cor- 
relation matrix R. 

Our objective in this section will be to construct uncorrelated linear combina- 
tions of the measured characteristics that account for much of the variation in the 
sample. The uncorrelated combinations with the largest variances will be called the 
sample principal components. 

Recall that the n values of any linear combination 


ajx = 811% j1 = 842Xj;2 Aes sock: 81 pXjp, J= 1,2,...,7 


have sample mean ajX and sample variance ajSa,. Also, the pairs of values 
(a}x;,a2x,;), for two linear combinations, have sample covariance ajSa2 [see 
(3-36)]. 


Chapter 8 Principal Components 


The sample principal components are defined as those linear combinatigz 
which have maximum sample variance. As with the population quantities, “3 


Ee 
strict the coefficient vectors a; to satisfy aja; = 1. Specifically, 


First sample linear combination ajx; that maximizes 
principal component = the sample variance of a}x; subject 
to aja; = 1 
Second sample linear combination a}x; that maximizes the sample 


principal component = variance of a}x; subject to aja, = 1 and zero sampl 
covariance for the pairs (a{x;, a)x;) 


At the ith step, we have 


ith sample linear combjnation a;x; that maximizes the sample 
principal component = variance of ajx; subject to aja; = 1 and zero sample a 
covariance for all pairs (ajx;, ajXx;), k <i a 


wale 
Bed 
+ 
By 
<a 


The first principal component maximizes a/Sa, or, equivalently, 


ajSa, 


aia, (8-1 


By (2-51), the maximum is the largest eigenvalue a attained for the choi 
ae eigenvector €; of S. Successive choices of a; maximize (8-19) subject ie 

= alSe, = ajA,€,, or a; perpendicular to é€,. Thus, as in the proofs of Results; 
2 1-8.3, we obtain the following results concerning sample principal components; 


IfS = ASix} 3 is the p X p sample covariance matrix with eigenvalue- eigenvector, 
pairs Oy , €1), (Xa €),.- dys é,), the ith sample principal component is given = 
by * 

Ji = OK = Eyxy + Spx. +--+ t+ Eipxy, 8 = 1,2,...,P 


<= Xs = 0 and x is any observation on the variables” 


Sample variance(j,) = Min eS ep 
Sample covariance(j;, %) =0, i#k 


In addition, 
. P a * 
Total sample variance = > Sp Ay t Ag to FAD 
i=l 


and 


Summarizing Sample Variation by Principal Components 443 


We shall denote the sample principal components by )j, j2,-.., ¥p, itrespective 
of whether they are obtained from S$ or R.? The components constructed from S and 
R are not the same, in general, but it will be clear from the context which matrix is 
being used, and the single notation J; is convenient. It is also convenient to label the 
component coefficient vectors e; and the component variances A; for both situations. 

The observations x; are often “centered” by subtracting x. This has no effect on 
the sample covariance matrix § and gives the ith principal component 


yi = ef(x — x), i=1,2,...,p (8-21) 
for any observation vector x. If we consider the values of the ith component 
Yii = eé;(x; — x), j = 1,2,...,n (8-22) 


generated by substituting each observation x; for the arbitrary x in (8-21), then 


- 124. 1/2 1 
w= — D ei(x; — ¥) = a( Zs -9)=2a0=0 (8-23) 


That is, the sample mean of each principal component is zero. The sample variances 
are still given by the A;’s, as in (8-20). 


Example 8.3 (Summarizing sample variability with two sample principal components) 
A census provided information, by tract, on five sociceconomic variables for the 
Madison, Wisconsin, area. The data from 61 tracts are listed in Table 8.5 in the exercises 
at the end of this chapter. These data produced the following summary statistics: 


x’ = [4.47, 3.96, 71.42, 26.91, 1.64] 
total professional employed government median 
population degree age over16 employment home value 
(thousands) (percent) (percent) (percent) ($100,000) 
and 


3.397 —1.102 4.306 -—2.078 0.027 

-1.102 9.673 —1.513 10.953 1.203 

S= 4.306 -1.513 55.626 —28.937 —0.044 
-2.078 10.953 —28.937 89.067 0.957 

0.027. = =1.203 =—0.044 0.957 0.319 


Can the sample variation be summarized by one or two principal components? 


?Sample principal components also can be obtained from E= S,, the maximum likelihood esti- 
mate of the covariance matrix %, if the X, are normally distributed. (See Result 4.11.) In this case, 
provided that the eigenvalues of & are distinct, the sample principal components can be viewed as 
the maximum likelihood eslimates of the corresponding population counterparts. (See [1].) We shall 
not consider > because the assumption of normality i is not required in this section. Also, = has eigenvalues 
[{2 - 1)/n]A, and corresponding eigenvectors é,, where ( X,, é,) are the eigenvalue-eigenvector pairs for 
S. Thus, both § and & give the same sample principal components é/x [see (8-20)] and the same propor- 
tion of explained variance A,/{A, + Az +--+ + A,). Finally, both S and % give the same sample corrtela- 
tion matrix R, so if the variables are standardized, the choice of S or & is irrelevant. 


444 Chapter 8 Principat Components 
We find the following: 
Coefficients for the Principal Components 


(Correlation Coefficients in Parentheses) 
Variable 


é (r5,,x,) & (F522) 2) e4 


—0.039(—.22) 0.071(.24) 0.188 = 0.977 


Total pepulation 


Profession 0.105(.35) 0.130(.26) -0.961 0.171 
Employment (%) —0.492(—.68) 0.864(.73) 0.046 -0.091 
Government : : . 

employment (%) 0.863(.95) 0.480(.32) 0.153 —0.030 
Medium home 

value 0.009(.16) 0.015(.17) -0.125 0.082 
Variance (A,): 107.02 39.67 8.37 2.87 
Cumulative 

percentage of 

total variance 67.7 92.8 98.1 99.9 


The first principal component explains 67.7% of the total sample variance. The 
first two principal components, collectively, explain 92.8% of the total sample van-* 
ance. Consequently, sample variation is summarized very well by two principal com- : 
ponents and a reduction in the data from 61 observations on 5 observations to 61: 
observations on 2 principal components is reasonable. 

Given the foregoing component coefficients, the first principal componen 
appears to be essentially a weighted difference between the percent employed by 
government and the percent total employment. The second principal component : 
appears to be a weighted sum of the two. ir 


AS we said in our discussion of the population components, the component 
coefficients é,, and the correlations r;, ,, should both be examined to interpret the 
principal components. The correlations allow for differences in the variances of. 
the original variables, but only measure the importance of an individual XY without 
regard to the other X’s making up the component. We notice in Example 8.3, 
however, that the correlation coefficients displayed in the table confirm the 


interpretation provided by the component coefficients. 


The Number of Principal Components 


There is always the question of how many components to retain. There is no defin- , 
itive answer to this question. Things to consider include the amount of total sample: 
variance explained, the relative sizes of the eigenvalues (the variances of the same; 
ple components), and the subject-matter interpretations of the components. In ads 
dition, as we discuss later, a component associated with an eigenvalue neat a 
and, hence, deemed unimportant, may indicate an unsuspected linear dependency; 
in the data. fe 


Summarizing Sample Variation by Principal Components 445 


- > 


3.0 


2.0 


1.0 


2 1 2 3 4 5 6 : Figure 8.2 A scree plot. 


A useful visual aid to determining an appropriate number of principal 
components is a scree plot.’ With the eigenvalues ordered from largest to smallest, 
a scree plot is a plot of \; versus i—the magnitude of an eigenvalue versus its 
number. To determine the appropriate number of components, we look for an 
elbow (bend) in the scree plot. The number of components is taken to be the 
point at which the remaining eigenvalues are relatively small and all about 
the same size. Figure 8.2 shows a scree plot for a situation with six principal 
components. 

An elbow occurs in the plot in Figure 8.2 at about i = 3. That is, the eigenvalues 
after Az are all relatively small and about the same size. In this case, it appears, 
without any other evidence, that two (or perhaps three) sample principal compo- 
nents effectively summarize the total sample variance. 


Example 8.4 (Summarizing sample variability with one sample principal component) 
In a study of size and shape relationships for painted turtles, Jolicoeur and Mosi- 
mann [11] measured carapace length, width, and height. Their data, reproduced in 
Exercise 6.18, Table 6.9, suggest an analysis in terms of logarithms. (Jolicoeur [10] 
generally suggests a logarithmic transformation in studies of size-and-shape rela- 
tionships.) Perform a principal component analysis. 


3Scree is the rock debris at the bottom of a cliff. 


446 Chapter 8 Principal Components 


The natural logarithms of the dimensions of 24 male turtles have sample mean 
vector x’ = [4.725, 4.478, 3.703] and covariance matrix 


11.072 8.019 8.160 
S = 107] 8.019 6.417 6.005 
: 8.160 6.005 6.773 


A principal component analysis (see Panel 8.1 on page 447 for the output from 
the SAS statistical software package) yields the following summary: 


Coefficients for the Principal Components , 
(Correlation Coefficients in Parentheses) 


Variable (F544) 


In (length) 683(.99) —.159 
In (width) 510(97) -594 622 

In (height) 523 (,97) 788 324 
Variance (A,): 2330x103 60x10 36 x 10% 
Cumulative 


percentage of total 


variance 96.1 98.5 100 


A scree plot is shown in Figure 8.3. The very distinct elbow in this plot occurs 
ati = 2. There is clearly one dominant principal component. 

The first principal component, which explains 96% of the total variance, has an 
interesting subject-matter interpretation. Since 


J, = .683 In(length) + .510 (width) + .523 In (height) 
= In[(length)°(width)*°( height) >] 


4, x 103 


26 


0 i Figure 8.3 A scree plot for the 
1 2 3 turtle data. 


Summarizing Sample Variation by Principal Components 447 


PANEL 8.1 SAS ANALYSIS FOR EXAMPLE 8.4 USING PROC PRINCOMP. 


title ‘Principal Component Analysis’; 

data turtle; 

infile ‘E8-4.dat'; 

input length width height; 

x1 = fog(length); x2 =log(width); x3 =log(height); 
proc princomp cov data = turtle out = result; 

var X1 x2 x3; 


PROGRAM COMMANDS 


Principal Components Analysis 


24 Observations 
3 Variables 


Simple Statistics 
X2 
4.477573765 
0.080104466 


x1 
4.725443647 
0.105223590 


x3 
3.703 185794 
0.082296771 


Covariance Matrix 


x1 X2 x3 


0.0110720040 
0.0080191419 


0.0081596480 


0.0080191419 
0.0064167255 


0.0060052707 


0.008 1596480 
0.0060052707 


0.0067727585 


Total Variance = 0.024261488 


Eigenvalues of the Covariance Matrix 


Eigenvalue Difference Proportion Cumulative 
PRIN1 0.023303 0.022705 0.960508 0.96051 
PRIN2 0.000598 0.000238 0.024661 0.98517 
PRINZ 0.000360 0.014832 1.00000 


Eigenvectors 


PRINI PRIN2 PRIN3 
x1 0.683102 -.159479 ~.712697 
X2 0.510220 -.594012 0.621953 
x3 0.522539 . 0.324401 


0.788490 


448 Chapter 8 Principal Components 


the first principal component may be viewed as the In (volume) of a box with ag. - : 
justed dimensions. For instance, the adjusted height is (height) 323 which accounts, : 
in some sense, for the rounded shape of the carapace. a : 


Interpretation of the Sample Principal Components 


The sample principal components have several interpretations. First, suppose the” 
underlying distribution of X is nearly N,(#, X). Then the sample principal components, : 
y; = €(x — %) are realizations of population principal components Y; = e/(X — Ha 
which have an N,(0, A) distribution. The diagonal matrix A has entries A,, Ap,... 
and (A;, e;) are the eigenvalue-eigenvector pairs of 2. 
Also, from the sample values x;, we can approximate m by x and & by S. If Si is: 
positive definite, the contour consisting of all p X 1 vectors x satisfying 


if rp 


(x-x/S'(x-x =e (8-24) 


estimates the constant density contour (x ~ #)'S ‘(x — a) = c’ of the underlying ~ 
normal density. The approximate contours can be drawn on the scatter plot to indi- 
cate the normal distribution that generated the data. The normality assumption is 
useful for the inference procedures discussed in Section 8.5, but it is not required 
for the development of the properties of the sample principal components summa- 
rized in (8-20). 

Even when the normal assumption is suspect and the scatter plot may depart 
somewhat from an elliptical pattern, we can still extract the eigenvalues from S and ob-. 
tain the sample principal components. Geometrically, the data may be plotted as n 
points in p-space. The data can then be expressed in the new coordinates, which 
coincide with the axes of the contour of (8-24). Now, (8-24) defines a re 
that is centered at x and whose axes are given by the eigenvectors of S™! 
equivalently, of S. (See Section 2.3 and Result 4.1, with S in place of %.) The ienstte 
of these hyperellipsoid axes are proportional to A;, § = 1,2,..., p, where 
4,2 do Bee A, = O are the eigenvalues of S. 

Because é; has length 1, the absolute value of the ith principal component, 
|¥| = [ex ~ X)|, gives the length of the projection of the vector (x — x) on the 
unit vector é;. [See (2-8) and (2-9).] Thus, the sample principal components 
yi = O(x — X), i= 1,2,..., p, lie along the axes of the hyperellipsoid, and their 
absolute values are the lengths of the projections of x — x in the directions of the 
axes @;. Consequently, the sample principal components can be viewed as the 
result of translating the origin of the original coordinate system to x and then 
rotating the coordinate axes until they pass through the scatter in the directions of 
maximum variance. 

The geometrical interpretation of the sample principal components is illustrated 
in Figure 8.4 for p = 2. Figure 8.4(a) shows an ellipse of constant distance, centered — 
at X, with A, > Ay. The sample principal components are well determined. They” 
lie along the axes of the ellipse in the perpendicular directions of maximu' 
sample variance. Figure 8.4(b) shows a constant distance ellipse, centered at x, with:g 
A, = A2. If M= = dz, the axes of the ellipse (circle) of constant distance are not2 a 
uniquely determined and can lie in any two perpendicular directions, including the; 


Summarizing Sample Variation by Principal Components 449 


Xy %y * 
(x — ¥)'S'(x -X) =e? 


y 


(x — %)'S7}(x —X)=c? 


xy 


Figure 8.4 Sample principal components and ellipses of constant distance. 


directions of the original coordinate axes. Similarly, the sample principal components 
can lie in any two perpendicular directions, including those of the original coordi- 
nate axes. When the contours of constant distance are nearly circular or, equiva- 
lently, when the eigenvalues of S are nearly equal, the sample variation is homogeneous 
in all directions. It is then not possible to represent the data well in fewer than p 
dimensions. a 

If the last few eigenvalues A; are sufficiently small such that the variation in the 
corresponding é; directions is negligible, the last few sample principal components 
can often be ignored, and the data can be adequately approximated by their repre- 
sentations in the space of the retained components. (See Section 8.4.) 

Finally, Supplement 8A gives a further result concerning the role of the sam- 
ple principal components when directly approximating the mean-centered data 
Xj — X. 


Standardizing the Sample Principal Components 


Sample principal components are, in general, not invariant with respect to changes 
in scale. (See Exercises 8.6 and 8.7.) As we mentioned in the treatment of popula- 
tion components, variables measured on different scales or on a common scale with 
widely differing ranges are often standardized. For the sample, standardization is 
accomplished by constructing 


z; = D(x; — x) = V 522 j = 1,2,... 


450 Chapter 8 Principal Components 


The n X p data matrix of standardized observations 


Li 21 2412 "7° Zip 
Zu| 2) =| 2 1 Se 
Zz, Zn1 2n2 Znp 
41 —- % X12 — % X1p ~ Xp 
VSi4 : V 522 VSpp 
X21 - xX X22 — Xz %2p Xp 
= Vsq3 V $22 VSpp 
Xn1 — Xy Anz 7 X2 Xnp~ Xp 
VSI V 522 VSpp 
yields the sample mean vector [see (3-24)] 
s xy aa x} 
j=l VSys 


u Xjp~ Xp 
» V'Spp 
and sample covariance matrix [see (3-27)] 
S, = a (z - iw) (z - WZ) 
n-1 n n 
-! @-wy(Z- 1) 
n-1 
1 ! 
4 ZZ 
[ (nm —1)siy (n — U)si2 (n — 1)s1p 
St1 S11 W522 VSii VSpp 
1 (n ~ 1)s;2 (n — 1)s22 (n - 1)s2, 
“pai; VinVin = 52 Vin2 VSpp 
(n- (n- Dsip (n- (n- 1)s2p (n- 1)Spp 
L Vs11 VSpp VSpp V $22 VSpp Spp 


(8-26) 


(8-27) 


(8-28) 


The sample principal components of the standardized observations are given by 
(8-20), with the matrix R in place of S. Since the observations are already “centered” 
by construction, there is no need to write the components in the form of (8-21). 


Summarizing Sample Variation by Principal Components 451 


If z,,Z2,.-.,Z, are standardized observations with covariance matrix R, the ith 
sample principal component is 


hi = Giz = G2) + G22 +--+ Sipzp, 1 = 1,2,....p 
where (Aj, é,) is the ith eigenvalue-eigenvector pair of R with 
A; 2A2z +: BAD= 0. Also, 


Sample variance (};) = di i=1,2,...,p 
Sample covariance (};, ¥x) =0 i#tk 
In addition, (8-29) 
Total (standardized) sample variance = tr(R) = p = Ay + do feet A, 


and 


Using (8-29), we see that the proportion of the total sample variance explained 
by the ith sample principal component is 


Proportion of (standardized) i. 
sample variance duetoith |=— i=1,2,...,p (8-30) 
sample principal component P 


A rule of thumb suggests retaining only those components whose variances A, are 
greater than unity or, equivalently, only those components which, individually, ex- 
plain at least a proportion 1/p of the total variance. This rule does not have a great 
deal of theoretical support, however, and it should not be applied blindly. As we 
have mentioned, a scree plot is also useful for selecting the appropriate number of 
components. 


Example 8.5 (Sample principal components from standardized data) The weekly 
rates of return for five stocks (JP Morgan, Citibank, Wells Fargo, Royal Dutch Shell, 
and ExxonMobil) listed on the New York Stock Exchange were determined for the 
period January 2004 through December 2005. The weekly rates of return are 
defined as (current week closing price—previous week closing price)/(previous 
week closing price), adjusted for stock splits and dividends. The data are listed in 
Table 8.4 in the Exercises. The observations in 103 successive weeks appear to be 
independently distributed, but the rates of return across stocks are correlated, 
because as one expects, stocks tend to move together in response to general 
economic conditions. 

Let x1, %2,...,%5 denote observed weekly rates of return for JP Morgan, 
Citibank, Wells Fargo, Royal Dutch Shell, and ExxonMobil, respectively. Then 


%’ = [.0011, .0007, .0016, .0040, 0040} 


452 Chapter 8 Principal Components 


and 
1.000 632 511 115 155 
632 1.000 574 322 213 
R=] 511 574 1000 183 146 
A115 =.322.)—s 183 1.000 ~=.683 
155 213 «146 ~—683.-:1.000 


We note that R is the covariance matrix of the standardized observations 


xX — Xy 


X2 — % Xs — Xs 
z= = a= wi = 


The eigenvalues and corresponding normalized eigenvectors of R, determined by a 


computer, are 


A, = 2.437, | = [ 469, 532, 465, .387, .361] 
A, = 1.407, @) = [-.368, —.236,~.315, .585, .606] 
Ay = 501, 5 = [-.604, -.136, .772, .093, -.109] 
Ay = 400,  & = [ 363, -.629, 289, -.381, .493] 
As = .255, & =[ .384,—.496, 071, .595, —.498] 
Using the standardized variables, we obtain the first two sample principal 
components: 
Sy = Oz = .469z, + 5322, + 46523 + 3872, + 36125 
Jy = Oz = — 368z, — .2362z9 — .315z5 + 585z4 + .606z5 


These components, which account for 
Ata 2.437 + 1.407 
Ace : 2) 0% = Peat ) 100% = 77% 


of the total (standardized) sample variance, have interesting interpretations. The 
first component is a roughly equally weighted sum, or “index,” of the five stocks. 
This component might be called a general stock-market component, or, simply, a 
market component. 

The second component represents a contrast between the banking stocks 
(JP Morgan, Citibank, Wells Fargo) and the oil stocks (Royal Dutch Shell, Exxon- 
Mobil). It might be called an industry component. Thus, we see that most of the 
variation in these stock returns is due to market activity and uncorrelated industry 
activity. This interpretation of stock price behavior also has been suggested by 
King [12]. 

The remaining components are not easy to interpret and, collectively, represent 
variation that is probably specific to each stock. In any event, they do not explain 
much of the total sample variance. a 


Summarizing Sample Variation by Principal Components 453 


Example 8.6 (Components from a correlation matrix with a special structure) Geneticists 
are often concerned with the inheritance of characteristics that can be measured 
several times during an animal’s lifetime. Body weight (in grams) for n = 150 
female mice were obtained immediately after the birth of their first four litters.’ 
The sample mean vector and sample correlation matrix were, respectively, 


x’ = [39.88, 45.08, 48.11, 49.95] 
and 
1.000 .7501 6329 ~—-.6363 
7501 1.000 6925 .7386 
-6329  .6925 1.000 .6625 
6363 = .7386 ~=—- 6625: 1.000 


R= 


The eigenvalues of this matrix are 
A, = 3.085, A, = 382, A; = 342, and A, =.217 


We note that the first eigenvalue is nearly equal to 1 + (p— 1)r = 1+ (4 — 1)(.6854) 
= 3.056, where 7 is the arithmetic average of the off-diagonal elements of R. The 
remaining eigenvalues are small and about equal, although hi is somewhat smaller 
than Ay and As. Thus, there is some evidence that the corresponding population 
correlation matrix @ may be of the “equal-correlation” form of (8-15). This notion 
is explored further in Example 8.9. 

The first principal component 


A i ez = .49z, + 522. + 4923 + 5024 


accounts for 100(A,/p)% = 100(3.058/4)% = 76% of the total variance. Although 
the average postbirth weights increase over time, the variation in weights is fairly 
well explained by the first principal component with (nearly) equal coefficients. mm 


Comment. An unusually small value for the /ast eigenvalue from either the sam- 
ple covariance or correlation matrix can indicate an unnoticed linear dependency in 
the data set. If this occurs, one (or more) of the variables is redundant and should 
be deleted. Consider a situation where x,, x2, and x3 are subtest scores and the 
total score x, is the sum x, + x2 + x3. Then, although the linear combination 
e’x = [1,1,1, -1]x = x, + x) + x3 — x4 is always zero, rounding error in the 
computation of eigenvalues may lead to a small nonzero value. If the linear 
expression relating x4 to (x,,*x2,.x3) was initially overlooked, the smallest 
eigenvalue—eigenvector pair should provide a clue to its existence. (See the discus- 
sion in Section 3.4, pages 131-133.) 

Thus, although “large” eigenvalues and the corresponding eigenvectors are im- 
portant in a principal component analysis, eigenvalues very close to zero should not 
be routinely ignored. The eigenvectors associated with these latter eigenvalues may 
point out linear dependencies in the data set that can cause interpretive and compu- 
tational problems in a subsequent analysis. 


‘Data courtesy of J.J. Rutledge. 


454 Chapter 8 Principal Components 


8.4 Graphing the Principal Components 


Plots of the principal components can reveal suspect observations, as well as Provide. 
checks on the assumption of normality. Since the principal components are linear- 
combinations of the original variables, it is not unreasonable to expect them to be: 
nearly normal. It is often necessary to verify that the first few principal components 
are approximately normally distributed when they are to be used as the input data_ 
for additional analyses. 

The last principal components can help pinpoint suspect observations. Each: 
observation can be expressed as a linear combination os 
Xj; = (x/€,)€, + (xe) €, +--+ (xje,)e, . ot 

= Hibs + Hae, + + Hib 


of the complete set of eigenvectors é,, €),..., €, of S. Thus, the magnitudes of the last , 
principal components determine how well the ‘frst few fit the observations. That is,” 
dre, + Fae. +--+ + Jia 1€ _; differs from x; by Figg tee t Jip€p, the square of 
whose length is 7 fg tit rH p- Suspect observations will often be such that at least 
one of the eeordi ice i «+ +5 Jjp Contributing to this squared length will be large. 
(See Supplement 8A for more general approximation results.) - 

The following statements summarize these ideas. 


1. To help check the normal assumption, construct scatter diagrams for pairs of the 
first few principal components. Also, make Q-Q plots from the sample values 
generated by each principal component. 

2. Construct scatter diagrams and Q-Q plots for the last few principal compo- 
nents. These help identify suspect observations. 


Example 8.7 (Plotting the principal components for the turtle data) We illustrate 
the plotting of principal components for the data on male turtles discussed in 
Example 8.4. The three sample principal components are 


> 


It 


-683(x, — 4.725) + .510(x2 — 4.478) + .523(x3 — 3.703) 


ras 


—.159(xy — 4.725) — .594(x — 4.478) + .788(x3 — 3.703) 


we 
Hl 


Js = —.713(x, — 4.725) + .622(x. — 4.478) + .324(x3 — 3.703) 


where x, = In(length), x. = In(width), and x; = In (height), respectively. 

Figure 8.5 shows the Q-Q plot for }) and Figure 8.6 shows the scatter plot of 
(31, 32), The observation for the first turtle is circled and lies in the lower right cor- 
ner of the scatter plot and in the upper right corner of the Q-Q plot; it may be sus- 
pect. This point should have been checked for recording errors, or the turtle should. 
have been examined for structural anomalies. Apart from the first turtle, the scatter ; 
plot appears to be reasonably elliptical. The plots for the other sets of principal com-* 
ponents do not indicate any substantial departures from normality. a 


Graphing the Principal Components 455 


Figure 8.5 A Q-Q plot for the 
second principal component 2 from 
the data on male turtles. 


Figure 8.6 Scatter plot of the 
principal components ), and }» of the 
2 data on male turtles. 


The diagnostics involving principal components apply equally well to the 
checking of assumptions for a multivariate multiple regression model. In fact, 
having fit any model by any method of estimation, it is prudent to consider the 


vector of a) 


Residual vector = (observation vector) — Comes d)values 


or 


é = y, - By jf =1,2,...,0 (8-31) 
(px1)— (px1)_— (px 1) 


for the multivariate linear model. Principal components, derived from the 
covariance matrix of the residuals, 


1 A ad ox Aisy 
=o 2 (&; - &;) (€; — €;) (8-32) 


can be scrutinized in the same manner as those determined from a random 
sample. You should be aware that there are linear dependencies among the residuals 


from a linear regression analysis, so the last eigenvalues will be zero, within round- 
ing error. 


456 Chapter8 Principal Components 


8.5 Large Sample Inferences 


We have seen that the eigenvalues and eigenvectors of the covariance (correlation) 
matrix are the essence of a principal component analysis. The eigenvectors deter. 
mine the directions of maximum variability, and the eigenvalues specify the var. ~ 
ances. When the first few eigenvalues are much larger than the rest, most of the total 
variance can be “explained” in fewer than p dimensions. 

In practice, decisions regarding the quality of the principal component 
approximation must be made on the basis of the eigenvalue—eigenvectoy . 
pairs (A;, €;) extracted from § or R. Because of sampling variation, these eigen. -- 
values and eigenvectors will differ from their underlying population counter. 
parts. The sampling distributions of A; and é; are difficult to derive and beyond 
the scope of this book. If you are interested, you can find some of these derijva- 
tions for multivariate normal populations in [1], [2], and [5]. We shall simply sum- 
marize the pertinent large sample results. 


Large Sample Properties of A, and é; 


Currently available results concerning large sample confidence intervals for x and é, 
assume that the observations X,, X2,...,X,, are a random sample from a normal 
population. It must also be assumed that the (unknown) eigenvalues of & are dis- 
tinct and positive, so that A; > A, >--- > A, > 0. The one exception is the case 
where the number of equal eigenvalues is known. Usually the conclusions for dis- 
tinct eigenvalues are applied, unless there is a strong reason to believe that ¥ hasa 
special structure that yields equal eigenvalues. Even when the normal assumption is 
violated, the confidence intervals obtained in this manner still provide some indica- 
tion of the uncertainty in A; and é;. 

Anderson [2] and Girshick [5] have established the following large sample distribu- 


tion theory for the eigenvalues A’ = [y, .+-, Ap] and eigenvectors @),..., €, of S: 
1. Let A be the diagonal matrix of eigenvalues A,,...,A, of Z, then Vn (A — A) 
is approximately N,(0, 2A”). 
2. Let 


then Vn (@; — e;) is approximately N,(0, E;). 
3. Each i is distributed independently of the elements of the associated €;. 


Result 1 implies that, for n large, the i are independently distributed. Moreover, 
i has an approximate M(A;, 2A?/n) distribution. Using this normal distribution, we 
obtain P[|A; — a;| < z(@/2)A;V2/n] = 1 — @. A large sample 100(1 — a)% confi- 
dence interval for A; is thus provided by 


a a 


Ai wa A; 
(1 + Xa/2)V3n) ~~ 1 — zl@ayvan) 


(8-33) 


Large Sample Inferences 457 


where z(a/2) is the upper 100(a/2)th percentile of a standard normal distribution. 
Bonferroni-type simultaneous 100(1 — a)% intervals for m A;’s are obtained by 
replacing z(a/2) with z(a/2m). (See Section 5.4.) 

Result 2 implies that the é,’s are normally distributed about the corresponding 
e;’s for large samples. The elements of each é; are correlated, and the correlation 
depends to a large extent on the separation of the eigenvalues A,, Az,...,A, (which 
is unknown) and the sample size n. Approximate standard errors for the coeffi- 
cients é;, are given by the square roots of the diagonal elements of (1/n) E; where 
E, is derived from E; by substituting A,’s for the A,’s and @,’s for the e,’s. 


Example 8.8 (Constructing a confidence interval for A,) We shall obtain a 95% con- 
fidence interval for A,, the variance of the first population principal component, 
using the stock price data listed in Table 8.4 in the Exercises, 

Assume that the stock rates of return represent independent drawings from 
an N;(m,%) population, where & is positive definite with distinct eigenvalues 
Ay > Az > ++: > As > 0. Since n = 103 is large, we can use (8-33) with i = 1 to con- 
struct a 95% confidence interval for A, . From Exercise 8.10,A; = .0014 and in addition, 
z(.025) = 1.96. Therefore, with 95% confidence, 


.0014 .0014 
—_——— =, = ea ae or .0011 < A; = .0019 = 


(i+19V5) "(1 - 1.96/23) 


Whenever an eigenvalue is large, such as 100 or even 1000, the intervals gener- 
ated by (8-33) can be quite wide, for reasonable confidence levels, even though n is 
fairly large. In general, the confidence interval gets wider at the same rate that A; 
gets larger. Consequently, some care must be exercised in dropping or retaining 
principal components based on an examination of the A,’s. 


Testing for the Equal Correlation Structure 


The special correlation structure Cov(X;, X,) = Vojiox, p, or Corr (X;, X~) = p, 
all i # k, is one important structure in which the eigenvalues of © are not distinct 
and the previous results do not apply. 

To test for this structure, let 


lp: p 
1s 
Hy p= po =|" . f 
(pXp) “Te 
p p 1 
and 
AN: p # Po 


A test of Hy versus H, may be based on a likelihood ratio statistic, but Lawley [14] 
has demonstrated that an equivalent test procedure can be constructed from the off- 
diagonal elements of R. 


458 Chapter 8 Principal Components 


Lawley’s procedure requires the quantities 
= 1 P a 2 
lS es 1 2 rik k= 1,2,..., D3 ip cay ele 
(taeda tae 8 
Papa 2t ry 
It is evident that 7, is the average of the off-diagonal elements in the Ath column (or 


row) of R and Tf is the overall average of the off-diagonal elements. 
The large sample approximate @-level test is to reject Hp in favor of H, if 


1 (8-34) 


5 An Ay a ee ee 2 : 
T = =a) VD ie — TY FD HA TY | > xbpen(p-2n(@) (8-35) 
(1 - Tr) i<k k= 


where X¢p+1)(p-2)/2(@) is the upper (100a)th percentile of a chi-square distribution 
with (p + 1)(p ~ 2)/2 df. 


Example 8.9 (Testing for equicorrelation structure) From Example 8.6, the sample 
correlation matrix constructed from the n = 150 post-birth weights of female 


mice is 
1.0 7501 =.6329~—s 6363 
R= -7501 1.0 .6925  .7386 
6329 6925 1.0 6625 


6363 .7386 = .6625—-:1.0 


We shall use this correlation matrix to illustrate the large sample test in (8-35). 
Here p = 4, and we set 


lp p p 
el pe pe 
Ho: 9 = Po = 
0: P = Po jsi-E p 
ep ep pl 
A: p # po 


Using (8-34) and (8-35), we obtain 


7% = 5 (7501 + 6329 + .6363) = 6731, 7 = .7271, 
7; = 6626, 7, = .6791 : 


r= rer (.7501 + .6329 + .6363 + .6925 + .7386 + .6625) = .6855 


DD (44-7)? = (7501 — 6855)? 


i<k 
+ (6329 — 6855)? +--- + (.6625 — .6855)° 
= 01277 


Monitoring Quality with Principal Components 459 


4 
S (% — 7)’ = (.6731 — 6855)? + --- + (.6791 — .6855)? = .00245 
k=1 


(4 — 1)*[1 - (1 — .6855)?] 


a = 2.1329 
14 = (4 = 2)(1 = 68557 
and 
(150 — 1) 
= Ty — e855)? 101277 ~ (2.1329) (.00245)] = 11.4 


Since (p + 1)(p — 2)/2 = 5(2)/2 = 5, the 5% critical value for the test in (8-35) is 
x2(.05) = 11.07. The value of our test statistic is approximately equal to the large 
sample 5% critical point, so the evidence against Ho (equal correlations) is strong, 
but not overwhelming. eke “ 

As we saw in Example 8.6, the smallest eigenvalues Az, A3, and A, are slightly 
different, with A, being somewhat smaller than the other two. Consequently, with 
the large sample size in this problem, small differences from the equal correlation 
structure show up as statistically significant. = 


Assuming a multivariate normal population, a large sample test that all vari- 
ables are independent (all the off-diagonal elements of & are zero) is contained in 
Exercise 8.9. 


8.6 Monitoring Quality with Principal Components 


In Section 5.6, we introduced multivariate control charts, including the quality ellipse 
and the T? chart. Today, with electronic and other automated methods of data collec- 
tion, it is not uncommon for data to be collected on 10 or 20 process variables. Major 
chemical and drug companies report measuring over 100 process variables, including 
temperature, pressure, concentration, and weight, at various positions along the pro- 
duction process. Even with 10 variables to monitor, there are 45 pairs for which to cre- 
ate quality ellipses. Clearly, another approach is required to both visually display 
important quantities and still have the sensitivity to detect special causes of variation. 


Checking a Given Set of Measurements for Stability 


Let X,, X2,..., X,, be arandom sample from a multivariate normal distribution with 
mean yz and covariance matrix &. We consider the first two sample principal compo- 
nents, yj; = €i(x; — X) and jj; = €4(x; — x). Additional principal components 
could be considered, but two are easier to inspect visually and, of any two components, 
the first two explain the largest cumulative proportion of the total sample variance. 

If a process is stable over time, so that the measured characteristics are influ- 
enced only by variations in common causes, then the values of the first two principal 
components should be stable. Conversely, if the principal components remain stable 
over time, the common effects that influence the process are likely to remain con- 
stant. To monitor quality using principal components, we consider a two-part proce- 
dure. The first part of the procedure is to construct an ellipse format chart for the 
pairs of values (};1, 32) for j = 1,2,...,n. 


460 Chapter 8 Principal Components 


By (8-20), the sample variance of the first principal component J, is given by the 
largest eigenvalue A,, and the sample variance of the second principal component $ Se 
is the second-largest eigenvalue do- The two sample components are uncorrelated, 
so the quality ellipse for n large (see Section 5.6) reduces to the collection of pairs of 
possible values (A, J») such that 


“2 
Hi y2 
+ >< xia 8- 
ian? 34(@) ( 36) 


Example 8.10 (An ellipse format chart based on the first two principal components)" 
Refer to the police department overtime data given in Table 5.8. Table 8.1 contains 
the five normalized eigenvectors and eigenvalues of the sample covariance matrix §, 


The first two sample components explain 82% of the total variance. 
The sample values for all five components are displayed in Table 8.2. 


Table 8.1 Eigenvectors and Eigenvalues of Sample Covariance Matrix for 
Police Department Data 
Variable 
Appearances overtime (x;) i ‘ 629 ~-.643 
Extraordinary event (x2) 039 985 ~.077 —.151  -.007 
Holdover hours (x3) —.658 .107 582 250 ~.392 
COA hours (x4) -734 .069 .503 397 = ~.213 
Meeting hours (xs) ~.155 107 081 586 .784 
eA | re ee el Ghee os «er ee (ome ree, Pee sil 
A; | 2,770,226 1,429,206 - 628,129 221,138 99,824 


Table 8.2 Values of the Principal Components for 
the Police Department Data 


Period Yj Yj2 Yy3 Vis js 
1 2044.9 588.2 425.8 —189.1 —209.8 
2 —2143.7 —686.2 883.6 —565.9 —441.5 
3 —-177.8 —464.6 707.5 736.3 38.2 
4 —2186.2 450.5 ~184.0 443.7 —325.3 
5 —878.6 —545.7 115.7 296.4 437.5 
6 563.2 —1045.4 281.2 620.5 142.7 
7 403.1 66.8 340.6 —135.5 521.2 
8 —1988.9 —-801.8 -1437.3 —148.8 61.6 
9 132.8 563.7 125.3 68.2 611.5. 
10 —2787.3 —213.4 7.8 169.4 —202.3 
11 283.4 3936.9 -0.9 276.2 —159.6 
12 761.6 256.0 ~-2153.6 —418.8 28.2 
13 —498.3 244.7 966.5 —1142.3 182.6 
14 2366.2 -1193.7 -165.5 270.6 —344.9 
15 1917.8 —782.0 —82.9 —196.8 —89.9 
16 2187.7 —373.8 170.1 —84.1 —250.2 


3000 


0 


—3000 


Monitoring Quality with Principal Components 461 


Figure 8.7 The 95% control ellipse 
~5000 — 2000 0 2000 = 4000 based on the first two principal 
y components of overtime hours. 


Let us construct a 95% ellipse format chart using the first two sample principal 
components and plot the 16 pairs of component values in Table 8.2. 
Although n = 16 is not large, we use x3(.05) = 5.99, and the ellipse becomes 


a2. a2 
hy 22 5.00 
ix te 


This ellipse centered at (0, 0), is shown in Figure 8.7, along with the data. 

One point is out of control, because the second principal component for this 
point has a large value. Scanning Table 8.2, we see that this is the value 3936.9 for pe- 
riod 11. According to the entries of €, in Table 8.1, the second principal component 
is essentially extraordinary event overtime hours. The principal component approach 
has Jed us to the same conclusion we came to in Example 5.9. = 


In the event that special causes are likely to produce shocks to the system, the 
second part of our two-part procedure—that is, a second chart—is required. This 
chart is created from the information in the principal components not involved in 
the ellipse format chart. 

Consider the deviation vector X — yz, and assume that X is distributed as 
N,(#,%). Even without the normal assumption, X; — w can be expressed as the 
sum of its projections on the eigenvectors of & 


X— p= (X— p)'ee; + (X — p)'e:€2 
+ (KX — p)'e3e3; +--+ (X — p)'e,e, 


62 Chapter 8 Principal Components 


or 
X — p= Ye, + He. + Yyey +--+ Y,ep (8-37) 
where Y; = (X ~ #)'e; is the population ith principal component centered to have 


mean 0. The approximation to X — ys by the first two principal components has the 
form Y,e, + Ye. This leaves an unexplained component of 


X- pw - Yer — Yep 


Let E = [e;, &,..., €,] be the orthogonal matrix whose columns are the eigenvec. 
tors of &. The orthogonal transformation of the unexplained part, 


Y Y, 0 0 

% 0 Y, 0 0 
E'(X -#- Ye - he) =| %)-;0;/-{ O]=]%]=] 0 

& . : ¥(2) 

Y, 0 0 Y, 


so the last p — 2 principal components are obtained as an orthogonal transformation 
of the approximation errors. Rather than base the 7? chart on the approximation 
errors, we can, equivalently, base it on these last principal components. Recall that 


Var(Y;) = A; for i=1,2,...,p 


and Cov(Y;, Y,) = 0 for i # k. Consequently, the statistic ¥ (2) 2¥ ty. ¥) ¥(2)> based 
on the last p — 2 population principal components, becomes 


¥3 ¥3 ¥ 
ee po (8-38) 


This is just the sum of the squares of p ~ 2 independent standard normal variables, 
A;'/*¥,, and so has a chi-square distribution with p — 2 degrees of freedom. 

In terms of the sample data, the principal components and eigenvalues must be 
estimated. Because the coefficients of the linear combinations é; are also estimates, 
the principal components do not have a normal distribution even when the popula- 
tion is normal. However, it is customary to create a T?-chart based on the statistic 


a2 A a2 
Y73 Tee yj 
oe EE psu 


Ay OG Ap 


which involves the estimated eigenvalues and vectors. Further, it is usual to appeal 
to the large sample approximation described by (8-38) and set the upper control 
limit of the T?-chart as UCL = c? = y3_,(a). 

This T?-statistic is based on high-dimensional data. For example, when p = 20 
variables are measured, it uses the information in the 18-dimensional space perpen- 
dicular to the first two eigenvectors é; and é). Still, this 7? based on the unexplained 
variation in the original observations is reported as highly effective in picking up 
special causes of variation. 


Tj 


T2 


Monitoring Quality with Principal Components 463 


Example 8.11 (A T?-chart for the unexplained [orthogonal] overtime hours) 
Consider the quality control analysis of the police department overtime hours in 
Example 8.10. The first part of the quality monitoring procedure, the quality ellipse 
based on the first two principal components, was shown in Figure 8.7. To illustrate 
the second step of the two-step monitoring procedure, we create the chart for the 
other principal components. 

Since p = 5, this chart is based on 5 — 2 = 3 dimensions, and the upper control 
limit is x3(.05) = 7.81. Using the eigenvalues and the values of the principal com- 
ponents, given in Example 8.10, we plot the time sequence of values 


A 


2 “2 a2 

yi3 yy yjs 
Bete eaet 

Az; Ag As 


where the first value is T? = .891 and so on. The T?-chart is shown in Figure 8.8. 


UCL 


Period 


Figure 8.8 A 7?-chart based on the last three principal components of overtime hours. 


Since points 12 and 13 exceed or are near the upper control limit, something has 
happened during these periods. We note that they are just beyond the period in 
which the extraordinary event overtime hours peaked. 

From Table 8.2, y3, is large in period 12, and from Table 8.1, the large coefficients 
in e; belong to legal appearances, holdover, and COA hours. Was there some adjust- 
ing of these other categories following the period extraordinary hours peaked? sm 


Controlling Future Values 


Previously, we considered checking whether a given series of multivariate observa- 
tions was stable by considering separately the first two principal components and 
then the last p — 2. Because the chi-square distribution was used to approximate 
the UCL of the T?-chart and the critical distance for the ellipse format chart, no fur- 
ther modifications are necessary for monitoring future values. 


464 Chapter 8 Principal Components 


Tha divans « 


Example 8.12 (Control ellipse for future principal components) In Example 8.16 Wei 
determined that case 11 was out of control. We drop this point and recalculate the 
eigenvalues and eigenvectors based on the covariance of the remaining 15 obserya-: 
tions. The results are shown in Table 8.3. — 


Table 8.3 Eigenvectors and Eigenvalues from the 15 Stable Observations 


Appearances overtime (x,) 
Extraordinary event (x2) 


Holdover hours (x3) 
COA hours (x,) 
Meeting hours (xs) 


h, |2,964,749.9  672,995.1 396,596.5 194,401.0 92, 


The principal components have changed. The component consisting primarily of 
extraordinary event overtime is now the third principal component and is not includ- 
ed in the chart of the first two. Because our initial sample size is only 16, dropping a 
single case can make a substantial difference. Usually, at least 50 or more observa- 
tions are needed, from stable operation of the process, in order to set future limits. 

Figure 8.9 gives the 99% prediction (8-36) ellipse for future pairs of values for 
the new first two principal components of overtime. The 15 stable pairs of principal 


components are also shown. = 
s 
i= - 
isa) 
& 
o 


Figure 8.9 A 99% ellipse 
format chart for the first two ~ 
fats ae ZO. 0. principal components of 
¥1 future values of overtime. 


~3000 


Monitoring Quality with Principal Components 465 


In some applications of multivariate control in the chemical and pharmaceutical 
industries, more than 100 variables are monitored simultaneously. These include nu- 
merous process variables as well as quality variables. Typically, the space orthogonal 
to the first few principal components has a dimension greater than 100 and some of 
the eigenvalues are very small. An alternative approach (see [13]) to constructing a 
control chart, that avoids the difficulty caused by dividing a small squared principal 
component by a very small eigenvalue, has been successfully applied. To implement 
this approach, we proceed as follows. : 

For each stable observation, take the sum of squares of its unexplained component 


2 in A ’ = A A Pe 
dy; j 7X — Ypr€r — Yj2e2) (xj — ¥ ~ rer — F2€2) 


pax 


Note that, by inserting EE' = I, we also have 
2 z— 4.4 a 4 VEE = _ 43 ~ A Eno 
diy; = (xj — X — Yer — Yjae2) EE'(x; — X — Yyi€i — Yj2e2) = b>; Vik 


which is just the sum of squares of the neglected principal components. 

Using either form, the dj, j are plotted versus j to create a control chart. The 
lower limit of the chart is 0 and the upper limit is set by approximating the distribu- 
tion of dj, ; as the distribution of a constant c times a chi-square random variable with 
v degrees of freedom. 

For the chi-square approximation, the constant c and degrees of freedom v are 


chosen to match the sample mean and variance of the di; ;, j = 1,2,...,n. In particu- 
lar, we set 

ree . 2 

dy = i = ay; = Cv 

j=l 
2. 1 e 2 2 2 2 
se = D (di; ~ a yo = 2c*v 
j=1 


and determine 


The upper control limit is then cy2(a), where a = .05 or .01. 


Supplement 


THE GEOMETRY OF THE SAMPLE 
PRINCIPAL COMPONENT 
APPROXIMATION 


In this supplement, we shall present interpretations for approximations to the data 
based on the first r sample principal components. The interpretations of both the 
p-dimensional scatter plot and the 7-dimensional representation rely on the algebraic 
result that follows. We consider approximations of the form a = [ay, @,...,a,]' 
to the mean corrected data matrix (nxp. " 
[x1 — Kx. — X,...,x, — x] 
The error of approximation is quantified as the sum of the np squared errors 
n j n 
Y  - ¥- 9)", ¥— a) = DY (ay — FH — ay)” (BAN) 


j=l j=l t=1 


i] 


Result 8A.1 Let os be any matrix with rank(A) =r < min(p,7). Let E, = 
nXp 


[@,, &,..., €,], where é; is the ith eigenvector of §. The error of approximation sum 
of squares in (8A-1) is minimized by the choice 
(x; ~ x)’ 
A=|@ : » E,E; = [jiu2-.-9,1E! 
(Xn — x)! 


so the jth column of its transpose A’ is 


A; = Yi + Vireo + +--+ Hy €, 


466 


The Geometry of the Sample Principal Component Approximation 467 


where 
[Yr Yjas-- +> Yer] = [€M(xj; — &), €4(x; — X),.-., €F(x, — ¥)] 


are the values of the first r sample principal components for the jth unit. Moreover, 
n 


> (x) — B= a)! (ej ~ ¥ ~ i) = (n HOsgteday 
ie 


where Riss 4 Ap are the smallest eigenvalues of S. 


Proof. Consider first any A whose transpose A’ has columns a; that are a linear 
combination of a fixed set of r perpendicular vectors uy, uy,...,u,, so that 
U = [u;, w,...,u,] satisfies U'U = L. For fixed U, x; — x is best approximated by 
its projection on the space spanned by uy, up,..., u, (see Result 2A.3), or 


(x; a x)'uyuy + (x; = X)'u.u, teee + (x; m7 x)'u,u, 


uj (Xx, x) 
= [u,,uy,...,u,] waxy ye UU'(x; — x) (8A-2) 
u(x, — X) 


This follows because, for an arbitrary vector b,, 
x; — x — Ub; = x; — x — UU'(x, — x) + UU'(x; — x) — Ub; 
= (I — UU’) (x; — x) + U(U'(x, — x) — b;) 

so the error sum of squares is 
(x; — X — Ub,)'(x; — ¥ — Ub,) = (x; — x)'(I — UU')(x; — x) +0 

+ (U'(x; ~ x) — bj)(U'(x; — &) — bj) 
where the cross product vanishes because (I — UU')U =U — UU'U = 
U — U = 0. The last term is positive unless b; is chosen so that bj = U'(x; — x) 


and Ub; = UU’(x; — x) is the projection of x; — x on the plane. 
Further, with the choice a; = Ub; = UU'(x; — x), (8A-1) becomes 


2 (x; — ¥ — UU'(x; — X))'(x; — ¥ — UU'(x; — x)) 


= > (x; — ¥)' (I — UU’) (x, — x) 
- 


= > (x; — X)'(x; — ¥) — 5! (x; — ¥)'UU'(x; — x) (8A-3) 
= 


j=i 


We are now in a position to minimize the error over choices of U by maximizing the 
last term in (8A-3). By the properties of trace (see Result 24.12), 


> (x; ~ x)'UU'(x; — X) = > tl — k)'UU'(x; — x)] 
i= i= 


= > tr [UU' (x; — X) (x; — ¥)'] 
- 


= (n ~ 1) tr[UU’S] = (n—1)tr[U’SU] — (8A-4) 


468 Chapter 8 Principal Components 
That is, the best choice for U maximizes the sum of the diagonal elements of U'Sy. 
From (8- -19), selecting u, to maximize uj Su), the first diagonal element of U’SU, gives : 
u, = e. For u perpendicular to é;,u2Suzis. maximized by €. [See (2-52).] | Contig, 
we find that U = [€1, €2,...,é€,] = E, and A’ = E, E’ tix, — %%2 — H... Xn — Kas 
asserted. “a 
With this choice the ith diagonal element of U'SU is eSe; = 8!(A,6;) aa A; 6 3 


n n . 

t(O'SE] = As + het 8. Alto, 3 (aj —2)5)-8) = |S (ay 39 (0)-39)} 
j=l j=l = 

=(n - 1) tr(S) = (n- 1a +), +---+A,). Let U =U in (8A-3), and the 2 
error bound follows. a 


The p-Dimensional Geometrical Interpretation 


The geometrical interpretations involve the determination of best approximating 
planes to the p-dimensional scatter plot. The plane through the origin, determined _ 
by uy, u2,..-,U,, consists of al] points x with 


x = bu, + dou, + --- + bu, = Ub, for some b 


This plane, translated to pass through a, becomes a + Ub for some b. 
We want to Seles the r-dimensional plane a + Ub that minimizes the sum of 


squared distances S d} between the observations x; and the plane. \f x; is approxi- 
j=1 


n 
mated by a + Ub; with >) b; = 0, then 
j=l 
n 
» (x; —-a- Ub;)’ (x; —-an7 Ub,) 
j=1 
= > (xj — ¥ — Ub; + ® — a)'(x; — x — Ub; + x — a) 
= (x; -x- Ub;)' (x; -x- Ub,) + n(x = a)’ (x = a) 
= > (x) ~ ¥ — E,Ei(x; — %))' (x; — ¥ — E,E(x; - ®)) 


by Result 8A.1, since [Ub,,..., Ub,,] = A’ has rank (A) = r. The lower bound is 
reached by taking a = x, so the plane passes through the sample mean. This plane is 
determined by €), €),...,é,. The coefficients of @, are (x; — x) = jj,, the kth 
sample principal component evaluated at the jth observation. 

The approximating plane interpretation of sample principal components is 
illustrated in Figure 8.10. 

An alternative interpretation can be given. The investigator places a plane — 
through x and moves it about to obtain the Jargest spread among the shadows of the - 


a” ‘as ud, mat 
Sit Sb; = nb # 0, use a + Ub, = (a + Ub) + U(b; — b) = a” + Ubj. 
a ; 


The Geometry of the Sample Principal Component Approximation 469 


2 Figure 8.10 The r = 2-dimensional 
plane that approximates the scatter 
” 


plot by minimizing >) d?. 
js 


observations. From (8A-2), the projection of the deviation x; — x on the plane Ub is 
v; = UU'(x; — x). Now, ¥ = 0 and the swm of the squared lengths of the projection 
deviations 


n 


> viv; = 5 (xj ~ ¥)'UU'(x; — x) = (n — 1) tr[U’SU] 
j=l 


i=! 


is maximized by U = E. Also, since ¥ = 0, 
" 

(n — 1)8, = Sr v,-¥)' = > vv; 
j=l 


and this plane also maximizes dis total variance 
1 n 1 ud 
tr(S,) = —— vi hy, 
r(S,) =D Paha Goa) it Py 


The n-Dimensional Geometrical Interpretation 


Let us now consider, by columns, the approximation of the mean-centered data 
matrix by A. Forr = 1, the ith column (x; — %;, %2; — Xj.-+-; nj — ¥;J is approxi- 
mated by a multiple c;b’ of a fixed vector b’ = [by, by,..., ,,|. The square of the 
length of the error of approximation is 


Le = ¥ (4 — % — 7b;)? 
1 


j= 


Considering A to be of rank one, we conclude from Result 8A.1 that 
nXp 


€1€1(x, — x) yu 
ee\ (x, — X) 21 @ 


>> 


€6) (x, ns x) din 


470 Chapter 8 Principal Components 


(a) Principal component of S (b) Principal component of R 


Figure 8.11 The first sample principal component, },, minimizes the 
sum of the squares of the distances, L?; from the deviation vectors, 


. = 2! : 
d; = (xy; — X;, x2; — Xj,--..X,; ~ %;], toa line. 


Pp 

minimizes the sum of squared lengths >) L?. That is, the best direction is determined 
i=l ; 

by the vector of values of the first principal component. This is illustrated in 

Figure 8.11(a). Note that the longer deviation vectors (the larger s;;’s) have the most 


influence on the minimization of Ss i? 
i=l 
If the variables are first standardized, the resulting vector [(x,; — %)/ Vsj;, 
(x21 — X)/V5iz,---, (Xai — X,)/ V5; ] has length n — 1 for all variables, and each 
vector exerts equal influence on the choice of direction. [See Figure 8.11(b).] 
In either case, the vector b is moved around in 7-space to minimize the sum of 


P 

the squares of the distances 5’ L?. In the former case L? is the squared distance 
i=] 

between [x,; — X;, x2; — X;,.--; nj ~ %;] and its projection on the line determined 

by b. The second principal component minimizes the same quantity among all 

vectors perpendicular to the first choice. 


Exercises 


8.). Determine the population principal components Y; and Y2 for the covariance matrix 


5 2 
Also, calculate the proportion of the total population variance explained by the first 
principal component. 


8.2. Convert the covariance matrix in Exercise 8.1 to a correlation matrix p. 


(a) Determine the principal components Y, and Y, from p and compute the proportion 
of total population variance explained by Y;. 


8.3. 


8.4. 


8.5. 


8.6. 


8.7. 


Exercises 471 


(b) Compare the components calculated in Part a with those obtained in Exercise 8.1. 
Are they the same? Should they be? 


(c) Compute the correlations py,_z,, py,,z,, and py, z,- 
Let 


0 0 
4 0 
0 4 


Determine the principal components Y;, ¥2, and Y;. What can you say about the eigen- 
vectors (and principal components) associated with eigenvalues that are not distinct? 


Find the principal components and the proportion of the total population variance 
explained by each when the covariance matrix is 


ao op, 0 1 1 
L=| 0% o op |, ~—=<p<-= 
0. ¢ 7. 2 V2 
p o 


(a) Find the eigenvalues of the correlation matrix 
1 p p 

P=/p 1 p 

ppl 


Are your results consistent with (8-16) and (8-17)? 
(b) Verify the eigenvalue-eigenvector pairs for the p < p matrix P given in (8-15). 


Data on x, = sales and x2 = profits for the 10 largest companies in the world were 
listed in Exercise 1.4 of Chapter 1. 


From Example 4.12 
; — | 15560] ¢ _ [7476.45 303.62 
* 14.70 |’ 303.62 26.19 


(a) Determine the sample principal components and their variances for these data. (You 
may need the quadratic formula to solve for the eigenvalues of S.) 


(b) Find the proportion of the total sample variance explained by y,. 

(c) Sketch the constant density ellipse (x — x)'S"'(x — X) = 1.4, and indicate the 
principal components ), and } on your graph. 

(d) Compute the correlation coefficients Tp imo k = 1,2. What interpretation, if any, can 
you give to the first principal component? 

Convert the covariance matrix § in Exercise 8.6 to a sample correlation matrix R. 

(a) Find the sample principal components },, }2 and their variances. 

(b) Compute the proportion of the total sample variance explained by 5. 

(c) Compute the correlation coefficients r;, ,,,k = 1,2. Interpret 5,. 


(d) Compare the components obtained in Part a with those obtained in Exercise 8.6(a). 
Given the original data displayed in Exercise 1.4, do you feel that it is better to 
determine principal components from the sample covariance matrix or sample 
correlation matrix? Explain. 


472 Chapter 8 Principal Components 


8.8. Use the results in Example 8.5. 


(a) Compute the correlations r,,,, for i = 1,2 and & = 1,2,...,5. Do these correy 


: : é oe a 
tions reinforce the interpretations given to the first two components? Explain, : 


(b) Test the hypothesis 
1 p ppp 

p lp pp 
Hyp: P = Po=|e po 1p ep = 
pp pl op ‘ 
pppp l ~ 

versus 
Ny: P # Po 


at the 5% level of significance. List any assumptions required in carrying out this test. 


8.9. (A test that all variables are independent.) 
(a) Consider that the normal theory likelihood ratio test of Ho: & is the diagonal matrix 


O11 0 a 0 
0 « 
ne , a> 0 
0 0 sae pp 
Show that the test is as follows: Reject Hp if 
g(r? 
A= IS | =(|R["?<c 


For a large sample size, —2 In A is approximately x p-1)/2- Bartlett (3] suggests that 
the test statistic —2{1 - (2p + 11)/6n]In A be used in place of —21n A. This 
results in an improved chi-square approximation. The large sample a critical point is 
Xo(p~1)/2(@)- Note that testing ¥ = Xp is the same as testing p = I. 


(b) Show that the likelihood ratio test of Hp: X = of rejects Hy if 


ley. 
isn? - Ahh geometric mean 4, |"? 
= (ur(S)/p)"?? -_ 12.\P y va 

pa 
i=1 


arithmetic mean A; 


For a large sample size, Bartlett [3] suggests that 
—2[1 - (2p? + p + 2)/6pn] in A 


is approximately 7p42)(p-1)/2- Thus, the large sample a critical point is” 
X2p+2) (p~1)/2(@)- This test is called a sphericity test, because the constant density 
contours are spheres when & = o7I. 


Hint: 


Exercises 473 


(a) max L(, %) is given by (5-10), and max L(t, Xo) is the product of the univariate 
B, 


likelihoods, max (27r) 


By Oi; 


“2g Pl exp - 


i 


nm n 
(=~ mPa} Hence fi; = n'! Des 
=1 = 


j=1 


and 6;; = (1/n) >> (xj; — ay The divisor n cancels in A, so S may be used. 
j=1 


(b) Verify 6? = |z (xr — HY t+ Sp - 2) |/np under Ho. Again, 
j= j=l 


the divisors n cancel in the statistic, so S may be used. Use Result 5.2 to calculate the 
chi-square degrees of freedom. 


The following exercises require the use of a computer. 


8.10. The weekly rates of return for five stocks listed on the New York Stock Exchange are given 
in Table 8.4. (See the stock-price data on the following website: www. prenhall.com/statistics.) 


(a) Construct the sample covariance matrix S, and find the sample principal components 
in (8-20). (Note that the sample mean vector x is displayed in Example 8.5.) 


(b) Determine the proportion of the total sample variance explained by the first three 


principal components. Interpret these components. 


(c) Construct Bonferroni simultaneous 90% confidence intervals for the variances 
A,, Az, and A3 of the first three population components Y,, Y2, and Y3. 


(d) Given the results in Parts a—c, do you feel that the stock rates-of-return data can be 


summarized in fewer than five dimensions? Explain. 


Table 8.4 Stock-Price Data (Weekly Rate Of Return) 


Week 


“OSwomrnnawne 


ray 


JP 
Morgan 


0.01303 
0.00849 
0.01792 
0.02156 
0.01082 
0.01017 
0.01113 
0.04848 
0.03449 
—0.00466 


0.03732 
0.02380 
0.02568 
0.00606 
0.02174 
0.00337 
0.00336 
0.01701 
0.01039 
—0.01279 


Wells Royal 
Citibank Fargo Dutch Shell 
— 0.00784 —0.00319 —0.04477 
0.01669 —0.00621 0.01196 
—0.00864 0.01004 0 
—0.00349 0.01744 —0.02859 
0.00372 —0.01013 0.02919 
—0.01220 —0.00838 0.01371 
0.02800 0.00807 0.03054 
—0.00515 0.01825 0.00633 
—0.01380 —0.00805 —0.02990 
0.02099 —0.00608 — 0.02039 
0.03593 0.02528 0.05819 
0.00311 —0.00688 0.01225 
0.05253 0.04070 —0.03166 
0.00863 0.00584 0.04456 
0.02296 0.02920 0.00844 
—0.01531 —0.02382 — 0.00167 
0.00290 —0.00305 —0.00122 
0.00951 0.01820 —0.01618 
—0.00266 0.00443 —0.00248 
—0.01437 —0.01874 ~—0.00498 


Exxon 
Mobil 


0.00522 
0.01349 
—0.00614 
—0.00695 
0.04098 
0.00299 
0.00323 
0.00768 
—0.01081 
—0.01267 


0.01697 
0.02817 
--0.01885 
0.03059 
0.03193 
—0.01723 
—0.00970 
—0.00756 
—0.01645 
—0.01637 


474 Chapter 8 Principal Components 


8.11. Consider the census-tract data listed in Table 85. Suppose the observations op 
X; = median value home were recorded in ten thousands, rather than hundred thousands, 
of dollars; that is, multiply all the numbers listed in the sixth column of the table by 10. 


(a) Construct the sample covariance matrix § for the census-tract data when 
Xs5 = median value home is recorded in ten thousands of dollars. (Note that this.. 
covariance matrix can be obtained from the covariance matrix given in Example 83 
by multiplying the off-diagonal elements in the fifth column and row by 10 and the 
diagonal element sss by 100. Why?) 

(b) Obtain the eigenvalue-eigenvector pairs and the first two sample principal compo- 
nents for the covariance matrix in Part a. 


(c) Compute the proportion of total variance explained by the first two principal 
components obtained in Part b. Calculate the correlation coefficients, r, ,,, and 
interpret these components if possible. Compare your results with the results in’ 
Example 8.3. What.can you say about the effects of this change in scale on the 
principal components? 


8.12. Consider the air-pollution data listed in Table 1.5. Your job is to summarize these data in 
fewer than p = 7 dimensions if possible. Conduct a principal component analysis of the - 
data using both the covariance matrix S and the correlation matrix R. What have you 
learned? Does it make any difference which matrix is chosen for analysis? Can the data be 
summarized in three or fewer dimensions? Can you interpret the principal components? 


Table 8.5 Census-tract Data 


Total Professional Employed Government Median 
population degree age over 16 employment home value 


Tract (thousands) (percent) (percent) (percent) ($100,000) 


1 2.67 5.71 69.02 30.3 1.48 
2 2.25 4.37 72.98 43.3 1.44 
3 3.12 10.27 64.94 32.0 2.11 
4 5.14 744 71.29 24.5 1.85 
5 5.54 9.25 74.94 31.0 2.23 
- 6 5.04 4.84 53.61 48.2 1.60 
7 3.14 4.82 67.00 37.6 1.52 
8 2.43 2.40 67.20 36.8 1.40 
9 5.38 4.30 83.03 19.7 2.07 
0 7.34 2.73 72.60 24.5 1.42 
52 7.25 1.16 78.52 23.6 1.50 
53 544 2.93 73.59 22.3 1.65 
54 5.83 4.47 77.33 26.2 2.16 
55 3.74 2.26 79.70 20.2 1.58 
56 9.21 2.36 74.58 21.8 1.72 
57 2.14 6.30 86.54 17.4 2.80 
58 6.62 4.79 78.84 20.0 2.33 
59 4.24 5.82 71.39 27.1 1.69 2 
60 4.72 4.71 78.01 20.6 1.55 
61 6.48 4.93 74.23 20.9 1.98 


Note: Observations from adjacent census tracts are likely to be correlated. That is, these 61 observations may not 
constitute a random sample. Complete data set availabJe at www.prenhall.com/statistics. 


ints as 


8.13. 


8.14. 


8.15. 


8.16. 


Exercises 475 


In the radiotherapy data listed in Table 1.7 (see also the radiotherapy data on the 
website www.prenhall.com/statistics), the n = 98 observations on p = 6 variables rep- 
reSent patients’ reactions to radiotherapy. 

(a) Obtain the covariance and correlation matrices S and R for these data. 

(b) Pick one of the matrices S or R (justify your choice), and determine the eigenval- 
ues and eigenvectors. Prepare a table showing, in decreasing order of size, the per- 
cent that each eigenvalue contributes to the total sample variance. 

(c) Given the results in Part b, decide on the number of important sample principal 
components, Is it possible to summarize the radiotherapy data with a single reaction- 
index component? Explain. 


(d) Prepare a table of the correlation coefficients between each principal component 
you decide to retain and the original variables. If possible, interpret the components. 


Perform a principal component analysis using the sample covariance matrix of the 
sweat data given in Example 5.2. Construct a Q-Q plot for each of the important 
principal components. Are there any suspect observations? Explain. 


The four sample standard deviations for the postbirth weights discussed in Example 8.6 
are 


V5q1 = 32.9909, V52 = 33.5918, V533 = 36.5534, and Vs54, = 37.3517 
Use these and the correlations given in Example 8.6 to construct the sample covariance 
matrix §. Perform a principal component analysis using S. 


Over a period of five years in the 1990s, yearly samples of fishermen on 28 lakes in 
Wisconsin were asked to report the time they spent fishing and how many of each 
type of game fish they caught. Their responses were then converted to a catch rate per 
hour for 


x; = Bluegill x2 = Blackcrappie x3; = Smallmouth bass 


x4 = Largemouth bass x5 = Walleye X— = Northern pike 
The estimated correlation matrix (courtesy of Jodi Barnet) 


1 4919 2636 4653 -—.2277 0652 


4919 1 3127) = =.3506 -—.1917 ~—.2045 

R= 2635 3127 ] 4108 0647. ~—.2493 
4653 3506 .4108 1 ~.2249 = 2293 

—.2277 ~.1917 .0647 -.2249 1 -.2144 

0652 = .2045) 2493, 2293 —.2144 1 


is based on a sample of about 120. (There were a few missing values.) 

Fish caught by the same fisherman live alongside of each other, so the data should 
provide some evidence on how the fish group. The first four fish belong to the centrar- 
chids, the most plentiful family. The walleye is the most popular fish to eat. 


(a) Comment on the pattern of correlation within the centrarchid family x, through x4. 
Does the walleye appear to group with the other fish? 


(b) Perform a principal component analysis using only x, through x,. Interpret your 
results. 


(c) Perform a principal component analysis using all six variables, Interpret your results. 


476 Chapter 8 Principal Components 


8.17. Using the data on bone mineral content in Table 1.8, perform a principal component 
analysis of 8. 


8.18. The data on national track records for women are listed in Table 1.9. 


(a) Obtain the sample correlation matrix R for these data, and deterinine its eigenvalues : 
and eigenvectors. ae 


(b) Determine the first two principal components for the standardized variables. Pye. 
pare a table showing the correlations of the standardized variables with the compo... 
nents, and the cumulative percentage of the total (standardized) sample variance - 
explained by the two components. B 


(c) Interpret the two principal components obtained in Part b. (Note that the first 
component is essentially a normalized unit vector and might measure the athlet. ° 
ic excellence of a given nation. The second component might measure the rela. -: 
tive strength of a nation at the various running distances.) : 


(d) Rank the nations based on their score on the first principal component. Does this . 
ranking correspond with your inituitive notion of athletic excellence for the various 
countries? 


8.19. Refer to Exercise 8.18. Convert the national! track records for women in Table 1.9 to 
speeds measured in meters per second. Notice that the records for 800 m, 1500 m, 
3000 m, and the marathon are given in minutes. The marathon is 26.2 miles, or 
42,195 meters, long. Perform a principal components analysis using the covariance 
matrix § of the speed data. Compare the results with the results in Exercise 8.18. Do 
your interpretations of the components differ? If the nations are ranked on the basis of 
their score on the first principal component, does the subsequent ranking differ from 
that in Exercise 8.18? Which analysis do you prefer? Why? 


8.20, The data on national track records for men are listed in Table 8.6. (See also the data 
on national track records for men on the website www.prenhall.com/statistics) Repeat 
the principal component analysis outlined in Exercise 8.18 for the men. Are the results 
consistent with those obtained from the women’s data? 


8.21. Refer to Exercise 8.20. Convert the national track records for men in Table 8.6 to speeds 
measured in meters per second. Notice that the records for 800 m, 1500 m, 5000 m, 
10,000 m and the marathon are given in minutes. The marathon is 26.2 miles, or 
42,195 meters, long. Perform a principal component analysis using the covariance matrix 
S of the speed data. Compare the results with the results in Exercise 8.20. Which analysis 
do you prefer? Why? 


8.22. Consider the data on bulls in Table 1.10. Utilizing the seven variables YrHgt, FtFrBody, 
PrctFFB, Frame, BkFat, SaleHt, and SateWt, perform a principal component analysis 
using the covariance matrix S and the correlation matrix R. Your analysis should include 
the following: 


(a) Determine the appropriate number of components to effectively summarize the 
sample variability. Construct a scree plot to aid your determination. 

(b) Interpret the sample principal components. 

(c) Do you think it is possible to develop a “body size” or “body configuration” index 
from the data on the seven variables above? Explain. 

(d) Using the values for the first two principal components, plot the data in a two- 
dimensional space with }, along the vertical axis and }) along the horizontal axis. 
Can you distinguish groups representing the three breeds of cattle? Are there any 
outliers? 


(e) Construct a Q-Q plot using the first principal component. Interpret the plot. 


Exercises 477 


Table 8.6 National Track Records for Men 
100m 200m 400m 800m 1500m 5000m_ 10,000m Marathon | 

Country (s) (s) (s) (min) (min) (min) (min) (min) 
Argentina 10.23 20.37 46.18 1.77 3.68 13.33 27.65 129,57 
Australia 9.93 20.06 44.38 1.74 3.53 12.93 27.53 127.51 
Austria 10.15 20.45 45.80 1.77 3.58 13.26 27.72 132.22 
Belgium 10.14 20.19 45.02 1.73 3.57 12.83 26.87 127.20 
Bermuda 10,27 20.30 45.26 1.79 3.70 14.64 30.49 146.37 
Brazil 10.00 19.89 44.29 1.70 3,57 13.48 28.13 126.05 
Canada 9.84 20.17 44.72 1.75 3.53 13.23 27.60 130.09 
Chile 10.10 20,15 45.92 1.76 3.65 13.39 28.09 132.19 
Chima 10.17 20.42 45.25 1.77 3.61 13.42 28.17 129.18 
Columbia 10.29 = 20.85 45.84 1.80 3.72 13.49 27.88 131.17 
Cook Islands 10.97 22.46 51.40 1,94 4.24 16.70 35.38 171.26 
Costa Rica 1032 20.96 46.42 1.87 3.84 13.75 28.81 133.23 
Czech Republic 10.24 20.61 45.77 1.75 3.58 13.42 27.80 131.57 
Denmark 10.29 =20.52 45.89 1.69 3.52 13.42 27.91 129.43 
DominicanRepublic 10.16 2065 4490 1.81 3.73 14.31 30.43 146.00 
Finland 10.21 20.47 45.49 1.74 3.61 13.27 27,52 131.15 
France 10.02 20.16 44.64 1.72 3.48 12.98 27.38 126.36 
Germany 10.06 20.23 44.33 1.73 3.53 12.91 27.36 128.47 
Great Britain 9.87 19.94 44.36 1.70 3.49 13.01 27.30 127.13 
Greece 10.11 19.85 45,57 1.75 3.61 13.48 28.12 132.04 
Guatemala 10.32 21.09 4844 182 3.74 13.98  ~ 29.34 132.53 
Hungary 10.08 20.11 45.43 1.76 3.59 13.45 28.03 132.10 
India 10.33 20.73 45.48 1.76 3.63 13.50 28.81 132.00 
Indonesia 10.20 20.93 46.37 1.83 3.77 14.21 29.65 139.18 
Ireland 10.35 20.54 4558 1.75 3.56 13.07 27.78 129.15 
Israel 10.20 20.89 4659 1.80 3.70 13.66 28.72 134.21 
Italy 10.01 19.72 45.26 1.73 3.35 13.09 27.28 127.29 
Japan 10.00 20.03 44.78 1.77 3.62 13.22 27.58 126.16 
Kenya 10.28 «2043 44.18 =61.70 = 3.44 12.66 26.46 124.55 
Korea, South 10.34 ~=20.41 45.37 1.74 3.64 13.84 28.51 127.20 
Korea, North 10.60 21.23 46.95 1.82 3.77 13.90 28.45 129.26 
Luxembourg 10.41 20.77 47,90 1.76 3.67 13.64 28.77 134.03 
Malaysia 10.30 2092 46.41 1.79 = 3.76 14.11 29.50 149.27 
Mauritius 10.13 20.06 44.69 1.80 3.83 14.15 29.84 143.07 
Mexico 10.21 20.40 44.31 1.78 3.63 13.13 27.14 127.19 
Myanmar(Burma) 10.64 21.52 4863 1.80 3.80 14.19 29.62 139.57 
Netherlands 10.19 20.19 45.68 1.73 3.55 13.22 27.44 128.31 
New Zealand 10.11 20.42 46.09 1.74 3.54 13.21 27.70 128.59 
Norway 10.08 20.17 46.11 1.71 3.62 13.11 27.54 130.17 
Papua New Guinea 10.40 = 21.18 46.77 1.80 4.00 14.72 31.36 148.13 
Philippines 10.57 21.43 45.57 1.80 3.82 13.97 29.04 138.44 
Poland 10.00 19.98 44.62 1.72 3.59 13.29 27.89 129.23 
Portugal 9.86 20.12 46.11 1.75 3.50 13.05 27.21 126.36 
Romania 10.21 20.75 45.77 1.76 3.57 13.25, 27.67 132.30 
Russia 10:11 20.23 44.60 1.71 3.54 13.20 27.90 129.16 
Samoa 10.78 21.86 49.98 1.94 4.01 16.28 34.71 161.50 
Singapore 10.37 921.14 47.60 1.84 3.86 14.96 31.32 144.22 
Spain 10.17 20.59 44.96 1.73 3.48 13.04 27.24 127.23 
Sweden 10.18 20.43 45.54 1.76 3.61 13.29 27.93 130.38 
Switzerland 10.16 20.41 44,99 1.71 3.53 13.13 27.90 129.56 
Taiwan 10.36 20.81 46.72 1.79 3.77 13.91 29.20 134.35 
Thailand 10.23 20.69 46.05 1.81 3.77 14.25 29.67 139.33 
Turkey 10.38 21.04 46.63 1.78 3.59 13.45 28.33 130.25 
US.A. 9.78 19.32 43.18 1.71 3.46 12.97 27.23 125.38 


Source: IAAF/ATES Track and Field Statistics Handbook for the Helsinki 2005 Olympics. Courtesy of Ottavio Castellini. 


478 Chapter 8 Principal Components 


8.23. A naturalist for the Alaska Fish and Game Department studies grizzly bears with the 
goal of maintaining a healthy population. Measurements on n = 61 bears provided ¢ 


u rn he 
following summary statistics: e 


Variable Weight Body Neck Girth Head Head : 
(kg) length (cm) (cm) length 
(cm) (cm) : 


Sample 
mean x 


164.38 55.69 93.39 17.98 


Covariance matrix 


3266.46 1343.97 731.54 1175.50 162.68 238.37 
1343.97 721.91 324.25 537.35 80.17 117.73 
731.54 324.25 179.28 281.17 39.15 56.80 
1175.50 537.35 281.17 474.98 63.73 94.85 
162.68 80.17 39.15 63.73 9.95 13.88 
238.37 117.73 56.80 94.85 13.88 21.26 te 


Ss = 


(a) Perform a principal component analysis using the covariance matrix. Can the data 
be effectively summarized in fewer than six dimensions? 

(b) Perform a principal component analysis using the correlation matrix. 

(c) Comment on the similarities and differences between the two analyses. 


8.24. Refer to Example 8.10 and the data in Table 5.8, page 240. Add the variable x5 = regular 
overtime hours whose values are (read across) 


6187 7336 6988 6964 8425 6778 5922 7307 
7679 8259 10954 9353 6291 4969 4825 6019 


and redo Example 8.10. 


8.25. Refer to the police overtime hours data in Example 8.10. Construct an alternate control 
chart, based on the sum of squares d?, j; to monitor the unexplained variation in the orig- 
inal observations summarized by the additional principal components. 


8.26. Consider the psychological profile data in Table 4.6. Using the five variables, Indep, Supp, 
Benev, Conform and Leader, performs a principal component analysis using the covari- 
ance matrix S and the correlation matrix R. Your analysis should include the following: 
(a) Determine the appropriate number of components to effectively summarize the 

variability. Construct a scree plot to aid in your determination. 

(b) Interpret the sample principal components. 

(c) Using the values for the first two principal components, plot the data in a two- 
dimensional space with }, along the vertical axis and y, along the horizontal axis 
Can you distinguish groups representing the twa socioeconomic Jevels and/or the 
two genders? Are there any outliers? 

(d) Construct a 95% confidence interval for A,, the variance of the first population 
principal component from the covariance matrix. 


8.27. The pulp and paper properties data is given in Table 7.7. Using the four paper variables, 
BL (breaking length), EM (elastic modulus), SF (Stress at failure) and BS (burst: 
strength), perform a principal component analysis using the covariance matrix S and t 
correlation matrix R. Your analysis should include the following: 

(a) Determine the appropriate number of components to effectively summarize the 
variability. Construct a scree plot to aid in your determination. a 


Exercises 479 


(b) Interpret the sample principal components. 

(c) Do you think it it is possible to develop a “paper strength” index that effectively con- 
tains the information in the four paper variables? Explain. 

(d) Using the values for the first two principal components, plot the data in a two- 
dimensional space with y, along the vertical axis and jy) along the horizontal axis. 
Identify any outliers in this data set. 

8.28. Survey data were collected as part of a study to assess options for enhancing food secu- 
rity through the sustainable use of natural resources in the Sikasso region of Mali (West 
Africa). A total of n = 76 farmers were surveyed and observations on the nine variables 


x, = Family (total number of individuals in household) 


X2 = DistRd (distance in kilometers to nearest passable road) 


x3 = Cotton (hectares of cotton planted in year 2000) 
x4 = Maize (hectares of maize planted in year 2000) 
xs = Sorg (hectares of sorghum planted in year 2000) 
X6 = Millet (hectares of millet planted in year 2000) 
x7 = Bull (total number of bullocks or draft animals) 


Cattle (total); xy = Goats (total) 


i] 


X8 


were recorded. The data are listed in Table 8.7 and on the website www.prenhall.com/statistics 


(a) Construct two-dimensional scatterplots of Family versus DistRd, and DistRd versus 
Cattle. Remove any obvious outliers from the data set. 


Table 8.7 Mali Family Farm Data 
Family DistRD Cotton Maize Sorg Millet Bull Cattle Goats 


12 80 1.5 1.00 3.0 25 2 0 1 
54 8 6.0 4.00 0 1.00 6 32 5 
11 13 Ee) 1.00 0 0 0 0 0 
21 13 2.0 2.50 1.0 0 1 0 5 
61 30 3.0 5.00 0 0 4 21 0 
20 70 0 2.00 3.0 0 2 0 3 
29 35 1.5 2.00 0 0 0 0 0 
29 35 2.0 3.00 2.0 0 0 0 0 
57 9 5.0 5.00 0 0 4 5 2 
23 33 2.0 2.00 1.0 0 2 1 7 
20 0 1.5 1.00 3.0 0 1 6 0 
27 41 1.1 25 1.5 1.50 0 3 1 
18 500 2.0 1.00 1.5 .50 1 0 0 
30 19 2.0 2.00 4.0 1.00 2 0 5 
77 18 8.0 4.00 6.0 4,00 6 8 6 
21 500 5.0 1.00 3.0 4.00 1 0 5 
13 100 re) 50 0 1.00 0 0 4 
24 100 2.0 3.00 0 .50 3 14 10 
29 90 2.0 1.50 1.5 1.50 2 0 2 
57 90 10.0 7.00 0 1.50 a ha 8 7 


Source: Data courtesy of Jay Angerer. 


480 Chapter 8 Principal Components 


8.29. 


(b) Perform a principal component analysis using the correlation matrix R. Determine 
the number of components to effectively summarize the variability. Use the Propor- = 
tion of variation explained and a scree plot to aid in your determination. 

(c) Interpret the first five principal components. Can you identify, for example, a “farm 
size” component? A, perhaps, “goats and distance to road” component? 


fie 


Refer to Exercise 5.28. Using the covariance matrix § for the first 30 cases of car body . 

assembly data, obtain the sample principal components. 

(a) Construct a 95% ellipse format chart using the first two principal components 5, and 
yz. Identify the car locations that appear to be out of control. 

(b) Construct an alternative control chart, based on the sum of squares d}, ; jp to monitor: 
the variation in the original observations summarized by the remaining four Princi- - 
pal components. Interpret this chart. 


References 
PRETETEM CCS 


i: 


2. 


Anderson, T. W. An Introduction to Multivariate Statistical Analysis (3rd ed.). New York:.. 
John Wiley, 2003. 

Anderson, T. W. “Asymptotic Theory for Principal Components Analysis.” Annals of 
Mathematical Statistics, 34 (1963), 122-148. 


. Bartlett, M. S. “A Note on Multiplying Factors for Various Chi-Squared Approxima- 


tions.” Journal of the Royal Statistical Society (B), 16 (1954), 296-298. 


. Dawkins, B. “Multivariate Analysis of National Track Records.” The American Statisti- 


cian, 43 (1989), 110-115. 


. Girschick, M. A. “On the Sampling Theory of Roots of Determinantal Equations.” 


Annals of Mathematical Statistics, 10 (1939), 203-224. 


. Hotelling, H. “Analysis of a Complex of Statistical Variables into Principal Compo- 


nents.” Journal of Educational Psychology, 24 (1933). 417-441, 498-520. 


. Hotelling, H. “The Most Predictable Criterion.” Journal of Educational Psychology, 


26 (1935), 139-142. 


. Hotelling, H. “Simplified Calculation of Principal Components.” Psychometrika, 


1 (1936), 27-35. 


. Hotelling, H. “Relations between Two Sets of Variates.” Biometrika, 28 (1936), 321-377. 
. Jolicoeur, P. “The Multivariate Generalization of the Allometry Equation.” Biometrics, 


19 (1963), 497-499. 


. Jolicoeur, P., and J. E. Mosimann. “Size and Shape Variation in the Painted Turtle: A Prin- 


cipal Component Analysis.” Growth, 24 (1960), 339-354. 


. King, B. ‘Market and Industry Factors in Stock Price Behavior.” Journal of Business, 


39 (1966), 139-190. 


. Kourti, T., and J. McGregor, “Multivariate SPC Methods for Process and Product Moni- 


toring,” Journal of Quality Technology, 28 (1996), 409-428. 


. Lawley, D. N. “On Testing a Set of Correlation Coefficients for Equality.’ Annals of 


Mathematical Statistics, 34 (1963), 149-151. 


. Rao, C. R. Linear Statistical Inference and Its Applications (2nd ed.). New York: Wiley- 


Interscience, 2002. 


. Rencher, A. C. “Interpretation of Canonical Discriminant Functions, Canonical Variates ‘ 


and Principal Components.” The American Statistician, 46 (1992), 217-225. 


Chapter 


FACTOR ANALYSIS AND INFERENCE 
FOR STRUCTURED COVARIANCE 
MATRICES 


9.1 Introduction 


Factor analysis has provoked rather turbulent controversy throughout its history. Its 
modern beginnings lie in the early-20th-century attempts of Karl Pearson, Charles 
Spearman, and others to define and measure intelligence. Because of this early 
association with constructs such as intelligence, factor analysis was nurtured and 
developed primarily by scientists interested in psychometrics. Arguments Over the 
psychological interpretations of several early studies and the lack of powerful com- 
puting facilities impeded its initial development as a statistical method. The advent 
of high-speed computers has generated a renewed interest in the theoretical and 
computational aspects of factor analysis. Most of the original techniques have been 
abandoned and early controversies resolved in the wake of recent developments. It 
is still true, however, that each application of the technique must be examined on its 
own merits to determine its success. 

The essential purpose of factor analysis is to describe, if possible, the covariance 
relationships among many variables in terms of a few underlying, but unobservable, 
random quantities called factors. Basically, the factor model is motivated by the 
following argument: Suppose variables can be grouped by their correlations. That is, 
suppose all variables within a particular group are highly correlated among them- 
selves, but have relatively small correlations with variables in a different group. Then 
it is conceivable that each group of variables represents a single underlying construct, 
or factor, that is responsible for the observed correlations. For example, correlations 
from the group of test scores in classics, French, English, mathematics, and music 
collected by Spearman suggested an underlying “intelligence” factor. A second group 
of variables, representing physical-fitness scores, if available, might correspond to 
another factor. It is this type of structure that factor analysis seeks to confirm. 


481 


482 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices 


Factor analysis can be considered an extension of principal component analysis, 
Both can be viewed as attempts to approximate the covariance matrix &. However, 
the approximation based on the factor analysis model is more elaborate. The- 
primary question in factor analysis is whether the data are consistent with a . 
prescribed structure. = 


rs 


3 


9.2 The Orthogonal Factor Model f 


_ 


The observable random vector X,-with p componenits, has mean w and covariance“ 
matrix &. The factor model postulates that X is linearly dependent upon a few un_ 
observable random variables F,, F),..., F,,, called common factors, and p addition. - 
al sources of variation €, &2,...,&,, called errors or, sometimes, specific factors. Ip - 
particular, the factor analysis model is 


Xy — wy = CF t+ €y9F t+ + limEm + 1 


Xy — Me = €91F, + €22F) +--+ 4+ ComFn + & 
: : (9-1) 


XxX, ~ Kp = €niFy + €p2F2 + coef lomFm + Ep 


or, in matrix notation, 


X-w= L Fre (9-2) 
(px1) (pXm)}(mx1) (px) 

The coefficient €;; is called the /oading of the ith variable on the jth factor, so the matrix 
L is the matrix of factor loadings. Note that the ith specific factor ; is associated only 
with the ith response X;. The p deviations X; — 4, Xz — my,--., Xp — Mp are 
expressed in terms of p + m random variables F,, Fy,..., Fin, 1, €2,---, &p Which are 
unobservable. This distinguishes the factor model of (9-2) from the multivariate regres- 
sion madel in (7-23), in which the independent variables [whose position is occupied by 
F in (9-2)] can be observed. 

With so many unobservable quantities, a direct verification of the factor model 
from observations on X), X2,...,X, is hopeless. However, with some additional 
assumptions about the random vectors F and e, the model in (9-2) implies certain 
covariance relationships, which can be checked. 

We assume that 


E(F)= 0. Cov(F) = E[FF]= 1 
y, 0 0 

E(e)= 0,  Cov(e) = Elee’] = 0 (9-3) 
(px1) (Pp > ae : te. ok 
0 O «--- wy 


1 As Maxwell [12] points out, in many investigations the ¢; tend to be combinations of measurement 
error and factors that are uniquely associated with the individual variables. 3 


The Orthogonal Factor Model 483 


and that F and ¢ are independent, so 


Cov(e,F) = E(eF') = 0 
(pxm) 


These assumptions and the relation in (9-2) constitute the orthogonal factor model? 


Orthogonal Factor Model with m Common Factors 
X = pt L Fite 
(px) (px1)— (pXmm)(mx1)— (pX1) 
4; = mean of variable i 
ith specific factor (9-4) 


it 


gj 
F, = jth common factor 


£;; = loading of the ith variable on the jth factor 


The unobservable random vectors F and « satisfy the following conditions: 
F and « are independent 
E(F) = 0,Cov(F) =1 
E(e) = 0, Cov(e) = W, where ¥ is a diagonal matrix 


The orthogonal factor model implies a covariance structure for X, From the 
model in (9-4), 
(X — pw) (XK - w)’ = (LF + &«)(LF + «)' 
= (LF + €)((LF)’ + &') 
= LF(LF)' + e(LF)! + LFe' + ee’ 
so that 
& = Cov(X) = E(X — w)(X — pw)’ 
= LE(FF’)L’ + E(eF’)L’ + LE(Fe') + E(ee’) 
=LL’+ ¥ 
according to (9-3). Also by independence, Cov(e,F) = E(e,F') = 0 


Also, by the model in (9-4), (KX — #)F’ = (LF + ©)F’ = LFF’ + eF’. 
Cov (X, F) = E(X — w)F’ = LE(FF’) + E(eF’) = L. 


2 Allowing the factors F to be correlated so that Cov (F) is nor diagonal gives the oblique factor 
model. The oblique model presents some additional estimation difficulties and will not be discussed in this 
book. (See [20].) 


484 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices 


Covariance Structure for the Orthogonal Factor Model 


L Cov(X) = LL’ + ¥ 
or 


Var(X;) = Ch +--+ Ga + di fas 
(9-5) - 
Cov (X,, Xx) = €::€44 tree + €im€km ae 


2. Cov(X,F) = L 
or 
Cov (Xj, Fj) = 6 


The model X — yx = LF + € is linear in the common factors If the p responses 
X are, in fact, related to underlying factors, but the relationship is nonlinear, such as 
in xX, — phy >= €, 1A Fs + &, xX = pha (1, FoF; + &, and so forth, then the covari- 
ance structure LL’ + W given by (9-5) may not be adequate. The very important as- 
sumption of linearity is inherent in the formulation of the traditional factor model. 

That portion of the variance of the ith variable contributed by the m common 
factors is called the ith communality. That portion of Var (X;) = o,; due to the spe- 
cific factor is often called the uniqueness, or specific variance. Denoting the ith com- 
munality by h?, we see from (9-5) that 


a = tht. + Gt Wi 
——— See) Ve -————_Y 
Var(X,) = communality + specific variance 
or 
he= + Gy +--+ @, (9-6) 
and 


0;; = h? + dy, i=1,2,...,p 


The ith communality is the sum of squares of the loadings of the ith variable on the 
m common factors. 


Example 9.1 (Verifying the relation & = LL’ + V for two factors) Consider the co- 
variance matrix 

19 30 2 12 és 
30 S57) 5 23 a 
2 5 38 47 i 
12 23 47 68 


r= 


The Orthogonal Factor Model 485 


The equality 
19 30 2 12 41 200 0 
30.357 5 23] _} 7 2))4 7 -1 1 i 040 0 
2 5 38 47 -1 6){1 2 68 001 0 
12 23 47 68 1 8 00 0 3 
or 


X=LL’'+¥ 


may be verified by matrix algebra. Therefore, % has the structure produced by an 
m = 2 orthogonal factor model. Since 


€, e142 4 1 
L= (1 &2}_| 7 2 
€3, £39 —1 6 
41 €42 1 8 
yw, 0 0 0 2 00 0 
ee: 0 yw 0 0] _|0 4 0 0 
0 0 wy Of} |0 010 
0 0 0 & 00 0 3 


the communality of X;, is, from (9-6), 
t= C+ G2= "+P =17 
and the variance of X, can be decomposed as 


(i, + th) + W=hit+n 


o11= 
or 
19 = +t 4+ 2 = 17+2 
es, > rd 
variance = communality + specific 
variance 
A similar breakdown occurs for the other variables. = 


The factor model assumes that the p + p(p — 1)/2 = p(p + 1)/2 variances 
and covariances for X can be reproduced from the pm factor loadings ¢;; and the p 
specific variances y,. When m = p, any covariance matrix % can be reproduced ex- 
actly as LL’ [see (9-11)], so ¥ can be the zero matrix. However, it is when m is small 
relative to p that factor analysis is most useful. In this case, the factor model pro- 
vides a “simple” explanation of the covariation in X with fewer parameters than the 
p(p + 1)/2 parameters in. For example, if X contains p = 12 variables, and the fac- 
tor model in (9-4) with m = 2 is appropriate, then the p(p + 1)/2 = 12(13)/2 = 78 
elements of ¥ are described in terms of the mp + p = 12(2) + 12 = 36 parameters 
£;, and w; of the factor model. 


486 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices 


Unfortunately for the factor analyst, most covariance matrices cannot be fac. 
tored as LL’ + W, where the number of factors ™ is much less than p. The following 
example demonstrates one of the problems that can arise when attempting to deter. 
mine the parameters ¢;, and y; from the variances and covariances of the observable 
variables. 


Example 9.2 (Nonexistence of a proper solution) Let p = 3 and m = 1, and suppose 
the random variables X,, X,, and X; have the positive definite covariance matrix 


: 1 7 
XL=| 9 4 
7 1 


pr b& 


Using the factor model in (9-4), we obtain 
Xy — fy = €,R + & 
XX) — By = Oy, + 
X3 — ps = $3, F, + &3 


The covariance structure in (9-5) implies that 


L=LL’+¥ 
or 
1= Gi, + Wy, 90 = €1;€91 70 = €11€31 
1=, +g 40 = C403) 
1 = 1 + os 
The pair of equations 
10 = €1€5; 
40 = £2,031 


implies that 


40 
€2, = (2) 11 


Substituting this result for €,; in the equation 
90 = €1 561 


yields €?, = 1.575, or €;; = + 1.255. Since Var(F,) = 1 (by assumption) and 
Var(X,) = 1, €;; = Cov(X), F\) = Corr(X,, K). Now, a correlation coefficient 
cannot be greater than unity (in absolute value), so, from this point of view, 
| €,1| = 1.255 is too large. Also, the equation 


1=€,+ a, or w =1- 4, 


The Orthogonal Factor Model 487 


gives 
gw = 1 — 1.575 = —.575 


which is unsatisfactory, since it gives a negative value for Var (€,) = Yh. 

Thus, for this example with m = 1, it is possible to get a unique numerical solu- 
tion to the equations & = LL’ + Y. However, the solution is not consistent with 
the statistical interpretation of the coefficients, so it is not a proper solution. = 


When m > 1, there is always some inherent ambiguity associated with the factor 
model. To see this, let T be any m X m orthogonal matrix, so that TT’ = T’T = L. 
Then the expression in (9-2) can be written 


X-yw=LF+e=LIT'F+e¢=L*F* +e (9-7) 
where 
L*=LT and F*=T'F 
Since 
E(F*) = T'E(F) = 0 
and 


Cov(F*) = T'Cov(F)T = T’'T= I 
(mxm) 
it is impossible, on the basis of observations on X, to distinguish the loadings L from 
the loadings L*. That is, the factors F and F* = T’F have the same statistical prop- 
erties, and even though the loadings L* are, in general, different from the loadings 
L, they both generate the same covariance matrix &. That is, 


Y=LL' + V=LIT'L' + ¥ = (L*)(L*)' + ¥ (9-8) 


This ambiguity provides the rationale for “factor rotation,” since orthogonal matrices 
correspond to rotations (and reflections) of the coordinate system for X. 


Factor loadings L are determined only up to an orthogonal matrix T. Thus, the 
loadings 


L* = LT and L (9-9) 


both give the same representation. The communialities, given by the diagonal 
elements of LL’ = (L*) (L*)’ are also unaffected by the choice of T. 


The analysis of the factor model proceeds by imposing conditions that allow 
one to uniquely estimate L and ‘Y. The loading matrix is then rotated (multiplied 
by an orthogonal matrix), where the rotation is determined by some “ease-of- 
interpretation” criterion. Once the loadings and specific variances are obtained, fac- 
tors are identified, and estimated values for the factors themselves (called factor 
Scores) are frequently constructed. 


488 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices 


9.3 Methods of Estimation 


Given observations x1, x2,...,x, on p generally correlated variables, factor analysis 
seeks to answer the question, Does the factor model of (9-4), with a small number of 
factors, adequately represent the data? In essence, we tackle this statistical mode]:~ 
building problem by trying to verify the covariance relationship in (9-5). 

The sample covariance matrix $ is an estimator of the unknown population 
covariance matrix ©. If the off-diagonal elements of § are small or those of the sample | 
correlation matrix R essentially zero, the variables are not related, and a factor - 
analysis will not prove useful. In these circumstances, the specific factors play the. 
dominant role, whereas the major aim of factor analysis is to determine a few ' 
important common factors. 

If & appears to deviate significantly from a diagonal matrix, then a factor model 
can be entertained, and the initial problem is one of estimating the factor loadings ti; 
and Specific variances y;. We shall consider two of the most popular methods of para- 
meter estimation, the principal component (and the related principal factor) method 
and the maximum likelihood method.The solution from either methad can be rotated” 
in order to simplify the interpretation of factors, as described in Section 9.4, It is 
always prudent to try more than one method of solution; if the factor model is appro- 
priate for the problem at hand, the solutions should be consistent with one another. 

Current estimation and rotation methods require iterative calculations that must 
be done on acomputer. Several computer programs are now available for this purpose. 


The Principal Component (and Principal Factor) Method 


The spectral decompasition of (2-16) provides us with one factoring of the covariance ma- 
trix Z. Let & have eigenvalue—eigenvector pairs (A;,e;) with A, = Az =---= A, =O. 
Then 


pone t 
X = Ajeye, + Azeze) +--+ + Apeze, 


“Vige , 9-10 
[Vije, | Vigen i | VA, Via e2 (9-10) 


VA pep 


This fits the prescribed covariance structure for the factor analysis model having as 
many factors as variables (m = p) and specific variances 4; = 0 for all i. The load- 
ing matrix has jth column given by VA; e;. That is, we can write 


xy = LU + 0 =LL' (9-11) 
(pXp) — (pxp)(PXP) — (PXp) 

Apart from the scale factor Vij; the factor loadings on the jth factor are the coeffi- 
cients for the jth principal component of the population. : 

Although the factor analysis representation of & in (9-11) is exact, it is not par- 

ticularly useful: It employs as many common factors as there are variables and does 

not allow for any variation in the specific factors ¢ in (9-4). We prefer models that 
explain the covariance structure in terms of just a few common factors. One : 


Methods of Estimation 489 


approach, when the last p — m eigenvalues are small, is to neglect the contribution 
Of Amsi€m+i€ms+i +--+ + Apepe, to X in (9-10). Neglecting this contribution, we 
obtain the approximation 


/dy ej 
WATE: 

E=[Vher i Vine b+) Vamem) | or] = oem ly 01) 
Am Om 


The approximate representation in (9-12) assumes that the specific factors e in (9-4) 
are of minor importance and can also be ignored in the factoring of &. If specific 
factors are included in the model, their variances may be taken to be the diagonal 
elements of & — LL’, where LL’ is as defined in (9-12). 

Allowing for specific factors, we find that the approximation becomes 


X=LL'+¥ 


eee Ay ei y, O 0 

; ‘ ; A2 e} 0 0 9-13 

Si Vine Veeder vice) ele eS oo 
Am On a ae: 


m 
where y; = oj; ~ >, €7; fori = 1,2,..., p. 
=I 


i= 
To apply this approach to a data set x,,x2,...,X,, it is customary first to center 
the observations by subtracting the sample mean x. The centered observations 


Xj X}. Xj] = X1 
e Xj2 X2 X;2 — X2 : 
Kjp-K=) 0 )-| oS l= fe. j= 1,2,...," (9-14) 
Xjp Xp Xjp — Xp 


have the same sample covariance matrix S as the original observations. 
In cases in which the units of the variables are not commensurate, it is usually 
desirable to work with the standardized variables 


(xj1 — X1) 
Vs 
(Xj. — Xa) 
z; = V 592 j=1,2,...,n 


(x; = Xp) 


Vs» p 


PP 


whose sample covariance matrix is the sample correlation matrix R of the observa- 
tions x;,X2,...,X,. Standardization avoids the problems of having one variable with 
large variance unduly influencing the determination of factor loadings. 


490 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices 


The representation in (9-13), when applied to the sample covariance matrix § or 
the sample correlation matrix R, is known as the principal component solution. The 
name follows from the fact that the factor loadings are the scaled coefficients of the 
first few sample principal components. (See Chapter 8.) 


Principal Component Solution of the Factor Model 


The principal component factor analysis of the sample covariance matrix S is 
specified in terms of its ee pairs (1, e,), (Ao, @),.. 
(rp, é,), where hi Ap Bee Ap. Let m < p be the number of common ae 
tors. Then the matrix of estimated factor loadings {@; i} is given by 


T=([Vie Vine bet Vin eel (9-15) 


The estimated specific variances are provided by the diagonal elements of the . 
matrix S — LL’,so 


yy, 0 0 
7 be th Foc =e 
v Po. | with Gas — Dey (9-16) 
oe a se Al 
00+ & 
Communalities are estimated as 
™~ ~2 ~ y 
i= Ci + Ct + Ge (9-17) 


The principal component factor analysis of the sample correlation matrix is 
obtained by starting with R in place of S. 


For the principal component solution, the estimated loadings for a given 
factor do not change as the number of factors is increased. For example, if m = 1, 
T= [Vara], and if m = 2,L= [Vara ; dé], where (A,,@1) and (Ag, &) 
are the first two eigenvalue-eigenvector pairs for S (or R). 

By the definition of i, the diagonal elements of § are equal to the diagonal 
elements of LL’ - + ¥. . However, the off-diagonal elements of $ are not usually 
reproduced by LL’ + ¥. How, then, do we select the number of factors m? 

If the number of common factors is not determined by a priori considerations, 
such as by theory or the work of other researchers, the choice of m can be based on 
the estimated eigenvalues in much the same manner as with principal components. 
Consider the residual matrix 


S - (LL' + ¥) (9-18) 


resulting from the approximation of $ by the principal component solution. The diago- 
nal elements are zero, and if the other elements are also small, we may subjectively 
take the m factor model to be appropriate. Analytically, we have (see Exercise 9.5) 


Sum of squared entries of (S ~ (LL' + ¥)) <2, +--+ 23 (9-19) 


Methods of Estimation 491 


Consequently, a small value for the sum of the squares of the neglected eigenvalues 
implies a small value for the sum of the squared errors of approximation. 

Ideally, the contributions of the first few factors to the sample variances of the 
variables should be large. The contribution to the sample variance s,; from the 
first common factor is a: The contribution to the total sample variance, s;;+ 
S92 + -+- + spp = tr(S), from the first common factor is then 


a sg t ae +@ _ (Vara) (War &) =A, 


since the eigenvector é, has unit length. In general, 


e 
j 
Proportion of total Sit S24 +++ +55, for a factor analysis of $ 
sample variance | = ‘ (9-20) 
due to jth factor Aj for a factor analysis of R 
‘ P 


Criterion (9-20) is frequently used as a heuristic device for determining the appro- 
priate number of common factors. The number of common factors retained in the 
model is increased until a “suitable proportion” of the total sample variance has 
been explained. 

Another convention, frequently encountered in packaged computer programs, 
is to Set m equal to the number of eigenvalues of R greater than one if the sample 
correlation matrix is factored, or equal to the number of positive eigenvalues of § if 
the sample covariance matrix is factored. These rules of thumb should not be ap- 
plied indiscriminately. For example, m = p if the rule for S is obeyed, since all the 
eigenvalues are expected to be positive for large sample sizes. The best approach is 
to retain few rather than many factors, assuming that they provide a satisfactory in- 
terpretation of the data and yield a satisfactory fit to S or R. 


Example 9.3 (Factor analysis of consumer-preference data) In a consumer-preference 
study, a random sample of customers were asked to rate several attributes of a new 
product. The responses, on a 7-point semantic differential scale, were tabulated and 
the attribute correlation matrix constructed. The correlation matrix is presented next: 


Attribute (Variable) 1 2 3 4 5 


Taste 1[ 1.00 02 42 01 
Goodbuyformoney 2] .02 1.00 .13 .71 
Flavor 3] .96 13 1.00 .50 11 
Suitableforsnack 4] 42 71 50 1.00 
Provideslotsofenergy 5| .01 85 11 .79 1.00 


It is clear from the circled entries in the correlation matrix that variables 1 and 
3 and variables 2 and 5 form groups. Variable 4 is “closer” to the (2,5) group than 
the (1,3) group. Given these results and the small number of variables, we might ex- 
pect that the apparent linear relationships between the variables can be explained in 
terms of, at most, two or three common factors. 


492 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices 


The first two eigenvalues, ‘a = 2.85 and do = 1.81, of R are the only eigenyal. 
ues greater than unity. Moreover, m = 2 common factors will account for a cumula- 
tive proportion : 

Av+ dp _ 2.85 +181 _ 
Pp 5 7 


93 


of the total (standardized) sample variance. The estimated factor loadings, commy- ~ 
nalities, and specific variances, obtained using (9-15), (9-16), and (9-17), are given in 
Table 9.1. 


Table 9.1 


Estimated factor 


i, loadings Specific 
ei; = Vie Communalities variances 
Variable F, ht w= 1-hP 


1. Taste 
2. Good buy 


for money -78 
3. Flavor 65 
4. Suitable 

for snack 94 -.10 
5. Provides 


lots of energy 


Eigenvalues 


Cumulative 
proportion 

of total 
(standardized) 
sample variance S71 932 


Now, 


8 —.53 
re oe pane B 78 65 94 = 


94 -.10 
80 -.54 
020 0 0 0 100 .01 97 .44 00 
@ 2-0 0. 6 100 11 .79 91] 
+10 O 02 0 0 |= 1.00 53 11] ¢ 
0 0 0 11 0 1.00 81] : 
By 0: 0c. 0 107 1.00} ; 


Table 9.2 


Variable Ww=al-kw |) Fy by, =1—- AP 
1. J P Morgan 46 732 —.437 27 
2. Citibank 31 831 —.280 23 
3. Wells Fargo 47 726 —.374 33 
4. Royal Dutch Shell 63 .605 694 15 
5. ExxonMobil 68 


Methods of Estimation 493 


nearly reproduces the correlation matrix R. Thus, on a purely descriptive basis, we 
would judge a two-factor mode] with the factor loadings displayed in Table 9.1 as pro- 
viding a good fit to the data. The communalities (.98, .88, .98, .89, .93) indicate that the 
two factors account for a large percentage of the sample variance of each variable. 
We shall not interpret the factors at this point. As we noted in Section 9.2, the 
factors (and loadings) are unique up to an orthogonal rotation. A rotation of the 
factors often reveals a simple structure and aids interpretation. We shall consider 
this example again (see Example 9.9 and Panel 9.1) after factor rotation has been 
discussed. = 


Example 9.4 (Factor analysis of stock-price data) Stock-price data consisting of 
n = 103 weekly rates of return on p = 5 stocks were introduced in Example 8.5. 
In that example, the first two sample principal components were obtained from R. 
Taking m = 1 and m = 2, we can easily obtain principal component solutions to 
the orthogonal factor model. Specifically, the estimated factor loadings are the 
sample piincipal component coefficients (eigenvectors of R), scaled by the 
square root of the corresponding eigenvalues. The estimated factor loadings, 
communalities, specific variances, and proportion of total (standardized) sample 
variance explained by each factor for the m = 1 and m = 2 factor solutions are 
available in Table 9.2. The communalities are given by (9-17). So, for example, with 


= 2,h} = 0), + by = (732)? + (—.437)? = 73. 


One-factor solution Two-factor solution 


Estimated factor Specific 
loadings variances 


Estimated factor Specific 
loadings variances 


Cumulative 
proportion of total 
(standardized) 
sample variance 


| explained 


The residual matrix corresponding to the solution for m = 2 factors is 


0 099 -.185 ~.025 .056 
-099 0 —134 .014 —.054 
R- LE - ¥ =| —185 -134 0 003.006 
~025 014 003 O -.156 


056 -.054 006 -—.156 0 


494 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices 


The proportion of the total variance explained by the two-factor solution is « appreciably 
larger than that for the one-factor solution. However, for m = 2, LL’ produces 
numbers that are, in general, larger than the sample correlations. This is particularly 
true for T\3- 

It seems fairly clear that the first factor, F,, represents general economic con-- 
ditions and might be called a market factor. All of the stocks load highly on this fac. 
tor, and the loadings are about equal. The second factor contrasts the banking 
stocks with the oil stocks. (The banks have relatively large negative loadings, aad . 
the oils have large positive loadings, on the factor.) Thus, F, seems to differentiate 
stocks in different industries and might be called an industry factor. To summarize, 
rates of return appear to be determined by general market conditions and activities 
that are unique to the different industries, as well as a residual or firm specific - 
factor. This is essentially the conclusion reached by an examination of the sample 
principal components in Example 8.5. = 


A Modified Approach—the Principal Factor Solution 


A modification of the principal component approach is sometimes considered. We 
describe the reasoning in terms of a factor analysis of R, although the procedure is 
also appropriate for S. If the factor model p = LL’ + Y is correctly specified, the 
m common factors should account for the off-diagonal elements of P, as well as 
the communality portions of the diagonal elements 


pi = 1 = AP + yy; 


If the specific Bao contribution y; is removed from the diagonal or, equivalently, 
the 1 replaced by h?, the resulting matrix is @ — ¥ = LL’. 

Suppose, new, that initial estimates y; of the specific variances are available. 
Then replacing the ith diagonal element of R by h;? = 1 — w?, we obtain a “reduced” 
sample correlation matrix 


*2 
Ay ong 07 Tip 
#2 
hz Ay on 
R, : 2 OF ? 
42 
Np Tap «°° Np 


Now, apart from sampling variation, all of the elements of the reduced sample cor- 
relation matrix R, should be accounted for by the 7 common factors. In particular, 
R, is factored as 


R, = LiL? (9-21) 


where L; = {€;,} are the estimated loadings. 
The principal factor method of factor analysis employs the estimates 


ur = [Vara i Vina i Vie] 
3 m (9-22) : 


Wi=l- Day 
j=l 


Methods of Estimation 495 


where (Qt, @;), 2 = 1,2,..., mare the (largest) eigenvalue-eigenvector pairs deter- 
mined from R,. In turn, the communalities would then be (re)estimated by 


ws mm 
P= SG? (9-23) 
j=) 


The principal factor solution can be obtained iteratively, with the communality esti- 
mates of (9-23) becoming the initial estimates for the next stage. 
In the spirit ¢ of the principal component solution, consideration of the estimated 


eigenvalues Mu ; M,. F5 Ns helps determine the number of common factors to retain. 
An added complication is that now some of the eigenvalues may be negative, due to 
the use of initial communality estimates. Ideally, we should take the number of com- 
mon factors equal to the rank of the reduced population matrix. Unfortunately, this 
rank is not always well determined from R,, and some judgment is necessary. 
Although there are many choices for initial estimates of specific variances, the 
mast popular choice, when one is working with a correlation matrix, is #7 = 1/ ri 
where r’' is the ith diagonal element of R™!. The initial communality estimates then 
become 
*2 * 1 
hi =1-%=1-S (9-24) 


r! 


which is equal to the square of the multiple correlation coefficient between X; and 
the other p — 1 variables. The relation to the multiple correlation coefficient means 
that h;? can be calculated even when R is not of full rank. For factoring S, the initial 
specific variance estimates use s‘', the diagonal elements of S~). Further discussion 
of these and other initial estimates is contained in [6]. 

Although the principal component method for R can be regarded as a principal 
factor method with initial communality estimates of unity, or specific variances 
equal to zero, the two are philosophically and geometrically different. (See [6].) In 
practice, however, the two frequently produce comparable factor loadings if the 
number of variables is large and the number of common factors is small. 

We do not pursue the principal factor solution, since, to our minds, the solution 
methods that have the most to recommend them are the principal component 
method and the maximum likelihood method, which we discuss next. 


The Maximum Likelihood Method 
If the common factors F and the specific factors e can be assumed to be normally 
distributed, then maximum likelihood estimates of the factor loadings and specific 


variances may be obtained. When F; and «; are jointly normal, the observations 
X; - w = LF, + €;are then normal, and from (4-16), the likelihood is 


Lu.) = Om) EE be GE (E, Gs 8008-0) e-ay) 
A Qn Fe Bees, Se o) (9-25) 


P 1 ae ea 
x (2m)7 3] |" 2073) -m) 21a) 


496 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices 


which depends on L and ¥ through Y = LL’ + ¥. This model is still not wel 
defined, because of the multiplicity of choices for L made possible by orthogona] 
transformations. It is desirable to make L well defined by imposing the computa- 
tionally convenient uniqueness condition 


L’'¥"L =A adiagonal matrix (9-26) 


The maximum likelihood estimates L and ¥ must be obtained by numerical 
maximization of (9-25). Fortunately, efficient computer programs now exist that en- 
able one to get these estimates rather easily. 

We summarize some facts about maximum likelihood estimators and, for now, 
rely on a computer to perform the numerical details. 


Result 9.1. Let X,,X2,...,X,, be a random sample from N,(m#,%), where 
= = LL' + © is the covariance matrix for the m common factor model of (9-4), 
The maximum likelihood estimators L, W, and fa = X maximize (9-25) subject to 
L’¥"L being diagonal. 


The maximum likelihood estimates of the communalities are 


R= G+ +--+, fori = 1,2,...,p (9-27) 
sO 
ee ee tree es 
Proportion of total sample | _ €1j + fj + pi (9-28) 
variance due to jth factor 541 + Sy2 +18 + Spy 


Proof. By the invariance property of maximum likelihood estimates (see Section 4.3), 
functions of L and © are estimated of the same functions of L and W, In particu- 
lar, the communalities h? = 2, +--» + @, have maximum likelihood estimates 


~2 
W=6,+-: oe i m= 


If, as in (8-10), the variables are standardized so that Z = V(X — yz), then 
the covariance matrix p of Z has the representation 


Pp = V?XV? = (VOL) (VL) + Vee (9-29) 
Thus, p has a factorization analogous to (9-5) with loading matrix L, = V~'/?L and 


specific variance matrix ¥, = V'?¥V~”. By the invariance property of maxi- 
mum likelihood estimators, the maximum likelihood estimator of p is 


Pp = (VAL) (VOL) + Vee? 
=Li+ ¥, (9-30) 


where V~!/” and L are the maximum likelihood estimators of V~¥” and L, respec- 
tively. (See Supplement 9A.) 

AS a consequence of the factorization of (9-30), whenever the maximum likeli- 
hood analysis pertains to the correlation matrix, we call 


= Ot te4+@, §=1,2,...,p (9-31) | 


Methods of Estimation 497 


the maximum likelihood estimates of the communalities, and we evaluate the im- 
portance of the factors on the basis of 

2 92 a2 
Proportion of total (standardized) ) _ iy + Gay to + Opi (9-32) 
sample variance due to jth factor } — Pp : 


To avoid more tedious notations, the preceding i ’s denote the elements of b 


Comment. Ordinarily, the observations are standardized, and a sample corre- 
lation matrix is factor analyzed. The sample correlation matrix R is inserted for 
[(n — 1)/n]S_ in the likelihood function of (9-25), and the maximum likelihood 
estimates L, and ¥, are obtained using a computer. Although the likelihood in (9-25) is 
appropriate for S, not R, surprisingly, this practice is equivalent to obtaining the maxi- 
mum likelihood estimates L and ¥ based on the sample covariance matrix §, setting 
L, = VL and ¥, = V2 4 V2, Here V~” is the diagonal matrix with the recip- 
rocal of the sample standard deviations (computed with the divisor Vn) on the main 
diagonal. 

Going in the other direction, given the estimated loadings L, and specific 
variances VW, obtained from R, we find that the resulting maximum likelihood 
estimates for a factor analysis of the covariance matrix [(m — 1)/n|S are 


L = VL, and ¥ = V¥24,V1?, or 
j= Gj Voi and b= bei Fi 


where Gj; is the sample variance computed with divisor n. The distinction between 
divisors can be ignored with principal component solutions. = 


The equivalency between factoring § and R has apparently been confused in 
many published discussions of factor analysis. (See Supplement 9A.) 


Example 9.5 (Factor analysis of stock-price data using the maximum likelihood 
method) The stock-price data of Examples 8.5 and 9.4 were reanalyzed assuming 
an m = 2 factor model and using the maximum likelihood method. The estimated 
factor loadings, communalities, specific variances, and proportion of total (stan- 
dardized) sample variance explained by each factor are in Table 9.3.° The corre- 
sponding figures for the m = 2 factor solution obtained by the principal component 
method (see Example 9.4) are also provided. The communalities corresponding to 


the maximum likelihood factoring of R are of the form [see (9-31)] hy = ei + ee 
So, for example, 


= (115)? + (.765)? = .58 


3 The maximum likelihood solution leads to a Heywood case. For this example, the solution of the 
likelihood equations give estimated loadings such that a specific variance is negative. The software pro- 
gram obtains a feasible solution by slightly adjusting the loadings so that all specific variance estimates 
are nonnegative. A Heywood case is suggested here by the .00 value for the specific variance of Royal 
Dutch Shell. 


498 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices 


Table 9.3 


Maximum likelihood Principal components 


Estimated factor 
_ loadings 


Estimated factor Specific 
variances loadings variances 


Variable 


1. J P Morgan 
2. Citibank 322 .788 


3. Wells Fargo .182 652 54 .726 —.374 33 
4. Royal Dutch Shell |1.000 —.000 - 00 605 
5. Texaco 683 —.032 53 563 
——— —- 
Cumulative 
proportion of total 
(standardized) 
sample variance 
explained 


The residual matrix is 


R-LL’- ¥ =| -.002 = .002 0  =.000 001 
000 .000 000 0 000 
052 -.033 .001 .000 0 


The elements of R — LL’ — ¥ are much smaller than those of the residual matrix 
corresponding to the principal component factoring of R presented in Example 9.4. 
On this basis, we prefer the maximum likelihood approach and typically feature it in 
subsequent examples. 

The cumulative proportion of the total sample variance explained by the factors 
is larger for principal component factoring than for maximum likelihood factoring. 
It is not surprising that this criterion typically favors principal component factoring. 
Loadings obtained by a principal component factor analysis are related to the prin- 
cipal components, which have, by design, a variance optimizing property. [See the 
discussion preceding (8-19).] 

Focusing attention on the maximum likelihood solution, we see that all vari- 
ables have positive loadings on F;. We call this factor the market factor, as we did in 
the principal component solution. The interpretation of the second factor is not as 
clear as it appeared to be in the principal component solution. The bank stocks have 
large positive loadings and the oil stocks have negligible loadings on the second fac- 
tor F,. From this perspective, the second factor differentiaties the bank stocks from 
the oil stocks and might be called an industry factor. Alternatively, the second factor 
might be simply called a banking factor. 


Methods of Estimation 499 


The patterns of the initial factor loadings for the maximum likelihood solution 
are constrained by the uniqueness condition that L’¥'L be a diagonal matrix. 
Therefore, useful factor patterns are often not revealed until the factors are rotated 
(see Section 9.4). | 


Example 9.6 (Factor analysis of Olympic decathlon data) Linden [11] originally con- 
ducted a factor analytic study of Olympic decathlon results for all 160 complete 
starts from the end of World War II until the mid-seventies. Following his approach 
we examine the ” = 280 complete starts from 1960 through 2004. The recorded 
values for each event were standardized and the signs of the timed events changed 
so that large scores are good for all events. We, too, analyze the correlation matrix, 
which based on all 280 cases, is 


6386 4752-3227, 5520. 3262S 3509S «4008 = 1821 9 —.0352 
1.0000 4953 5668 ~=— 4706S 3520. 3998 )~— 5167 ~—.3102 1012 
4953 1.0000 4357. = .2539 2812, 7926) 4728 = 4682S —.0120 
5668 4357 1.0000 3449 .3503 .3657 6040 ~—.2344 .2380 
4706 2539-3449 «1.0000 1546 .2100 4213 ~—-.2116 4125 
.3520 2812 = 3503, 1546 1.0000 = 2553. «4163 «1712 .0002 
3998 7926 = .3657- — 2100 = .2553' 1.0000 =.4036 ~—.4179 0109 
5167 4728 =©.6040 ~—- .4213—s- 4163 .4036 «1.0000 ~—-.3151 12395 
.3102 4682 = .2344— 2116 = 1712, 4179 ~— 3151: 1.0000 .0983 
1012 -.0120 = .2380 = 4125. 0002. Ss 0109 —s 2395 —s «0983 =: 1.0000 


From a principal component factor analysis perspective, the first four eigen- 
values, 4.21, 1.39, 1.06, .92, of R suggest a factor solution with m = 3 orm =4.A 
subsequent interpretation, much like Linden’s original analysis, reinforces the 
choice m = 4. 

In this case, the two solution methods produced very different results. For the prin- 
cipal component factorization, all events except the 1,500-meter run have large positive 
loading on the first factor. This factor might be labeled general athletic ability. Factor 2, 
which loads heavily on the 400-meter run and 1,500-meter run might be called a run- 
ning endurance factor. The remaining factors cannot be easily interpreted to our minds. 

For the maximum likelihood method, the first factor appears to be a general ath- 
letic ability factor but the loading pattern is not as strong as with principal compo- 
nent factor solution. The second factor is primarily a strength factor because shot put 
and discus load highly on this factor. The third factor is running endurance since the 
400-meter run and 1,500-meter run have large loadings. Again, the fourth factor is 
not easily identified, although it may have something to do with jumping ability or 
leg strength. We shall return to an interpretation of the factors in Example 9.11 after 
a discussion of factor rotation. 

The four-factor principal component solution accounts for much of the total 
(standardized) sample variance, although the estimated specific variances are 
large in some cases (for example, the javelin). This suggests that some events 
might require unique or specific attributes not required for the other events. The 
four-factor maximum likelihood solution accounts for less of the total sample 


7 LS’ Sv Lt 9L’ Ly 9S" (a4 pomeydxa 
DOUBLIBA [B10] 
jo uonsodoid 

sayelnuns 


09° spT— 609° 160° 910°— cv $80" cor" Obl" 0% | UNIUOOST ‘OT 
£L sso’'- —- 80° Tor" 8It 6¢ blo~- 619° wt — BTS" UTlJaAeL °*6 
cr’ £97 06¢" LO’ Orr’ 0¢" vOE 810° cor T19L’ nea aod *g 
oC" S60'-_ ZOT'- 8IL cor’ £7 8L0°— 687 9Sh—- 069° snosiq “LZ 
€L’ €7¢° 060° 681° ere 8c 19S" cle — = €80"- Ss so[pmny 
WOOI 9 
a soe — = 029" 610° ILS” Lv Loe — SPO- 6FS S09 UNI U-QOP *S 
ee vp Th" 82h" oe ee ee = SOOT TTL | | dunt yaiey “p 
60° ' 6L0-— Tel’ - LLL oes LY cIl— = L6T" rep— = TLL ind joys ’€ 
6¢° 027 62 CS 699° 67 SIT-— ssv- SLO €6L’ dumnf duoT *Z 
TO" 700° T%O"— 690—- £66 (45 9Ip— 89r—- 7H 969° UNI U-QOT “T 
y-t= | wy a oy fm@-1=% | YY YY | aaeu, | 
SQOURTIBA sduipeo] SOOURTIEA SdUIpeo] 
ayweds 10}e] poyewsy aigioeds 10}9eJ poyewlliss] 
pooyleyl] UinuIxeyy qusuoduios jedioutig 


'6 21981 


500 


Methods of Estimation 501 


variance, but. as the following residual matrices indicate, the maximum likelihood 
estimates L and Y do a better job of reproducing R than the principal component 
estimates L and ¥. 


Principal component: 


R-LL'- ¥= 
0 082 -006 -.021 -.068 031 -016 .003 .039 .062 
—.082 0 -046 033 -.107 ~.078 -.048 -.059 .042 .006 
~.006 —.046 0 006 -.010 -.014 ~.003 -.013 —.151 055 
~021 .033 .006 0 ~038 -.204 -.015 ~-.078 ~.064 ~,.086 
~068 —107 -.010 —.038 0 096 025 ~.006 030 -.074 
031-078 ~.014 -.204 096 0 015 ~.124 119 085 
-016 -.048 -.003 -.015 025 015 0 -.029 -.210 .064 
003 “-.059 -—013 ~.078 -.006 -.124 -.029 0 -.026 —.084 
039 042 ~151 ~-.064 030 119 210 -.026 0 —.078 
062 006 055 —.086 -.074 085  .064 ~.084 -.078 0 


Maximum likelihood: 


—.000 023 004 0 -.002 -.030 -.004 -.006 -.042 .010 
—.000 005 ~-.001 -.002 0 -.002 001 001 000 ~-.001 
000 -.017 -.009 -.030 -.002 0 022 069 029 -.019 
-.000 ~-.003 000 -.004 .001 022 0 -.000 -.000 000 
000 -.030 -.001 -.006 .001 069 —.000 0 021 011 
—.001 047 -.001 -.042 001 029 —.000 021 0 —.003 
000 —.024 000 010 -.001 -—.019 000 = =.011 + -.003 0 
= 


A Large Sample Test for the Number of Common Factors 


The assumption of a normal population leads directly to a test of the adequacy of 
the model. Suppose the m common factor model holds. In this case & = LL’ + ¥, 
and testing the adequacy of the m common factor model is equivalent to testing 
HA: % = L LU + ¥ (9-33) 
(pXp) — (pXm) (mXp) — (pXp) 
versus H):& any other positive definite matrix. When © does not have any special 
form, the maximum of the likelihood function [see (4-18) and Result 4.11 with 2 = 
((n — 1)/n)S = §,] is proportional to 
|S, [-"7e"?? (9-34) 


502 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices 


Under Hp, is restricted to have the form of (9- 33). In this case, the maximum of 
the likelihood function [see (9-25) with a = X and %= LL’ + vy, where L and 
are the maximum likelihood estimates of L and Y, respectively] is proportional to 


sren(-He#°($0, 001-0) 


=|LL' + 8"? exp(—Antr[(LL + &)'S, (9-35) 
P\~2 


Using Result 5.2, (9-34), and (9-35), we find that the likelihood ratio statistic for 


testing Hp is 
Stick aI maximized likelihood under Hp 
= 7 maximized likelihood 
; (9-36) 
Panes A 
= -2In is, | + n{tr(X"S,) — p] 
with degrees of freedom, 
vo = }p(p + 1) ~ [pm + 1) — }m(m — 1)] ees 


Supplement 9A indicates that tr(Z'S,,) — p = 0 provided that Z=LL' + Gis 
the maximum likelihood estimate of } = LL’ + VW. Thus, we have 


-2inA = nin( 121) (9-38) 


Bartlett [3] has shown that the chi-square approximation to the sampling distri- 
bution of —21n A can be improved by replacing n in (9-38) with the multiplicative 
factor (n — 1 — (2p + 4m + 5)/6). 

Using Bartlett’s correction,’ we reject Ho at the a level of significance if 


|LL’ + ¥| 


(n —1— (2p + 4m + 5)/6) In (ssl 


> Xf(p-m)?—p-my2(@) (9-39) 
provided that n and n — p are large. Since the number of degrees of freedom, 
i{(p — m)* — p — m], must be positive, it follows that 

m<i(2p+1—- V8p +1) (9-40) 
in order to apply the test (9-39). 


4 Many factor analysts obtain an approximate maximum likelihood estimate by replacing S,, with 
the unbiased estimate § = [n/(n — 1)]S, and then minimizing In| ¥ | + tr[X~'S]. The dual substitution 
of S and the approximate maximum likelihood estimator into the test statistic of (9-39) does not affect its 
large sample properties. 


Methods of Estimation 503 


Comment. In implementing the test in (9-39), we are testing for the adequacy 
of the m common factor model by comparing the generalized variances |LL’ + ¥| 
and |S,,|. If n is large and m is small relative to p, the hypothesis Ho will usually be 
rejected, leading to a retention of more common factors. However, } = LL’ + ¥ 
may be close enough to S,, so that adding more factors does not provide additional 
insights, even though those factors are “significant.” Some judgment must be exer- 
cised in the choice of m. 


Example 9.7 (Testing for two common factors) The two-factor maximum likelihood 
analysis of the stock-price data was presented in Example 9.5. The residual 
matrix there suggests that a two-factor solution may be adequate. Test the hypothesis 
Ho: &% = LL’ + V, with m = 2, at level a = .05. 

The test statistic in (9-39) is based on the ratio of generalized variances 


[Z| ti + &] 
|S, | |S, | 


Let V7”? be the diagonal matrix such that V~!s,,V~12 = R. By the properties of 
determinants (see Result 2A.11), 


[VoL + 8/02] = |W OPEL 0? + Ve? | 
and , 
[Vt ||S,}| V1? | = |v-i?s,V-1?| 
Consequently, 
|S) V2 oi + FV 
(S.f pV?) [Sal Hy 


| Vi2pp yt 4 we py? | 


~ . -41 
| V2"s,V-"? | oe 
_ [LL + 
|R| 
by (9-30). From Example 9.5, we determine 
1.000 
632 1.000 
513.572 1.000 
115 322.182 1.000 
|L,Li + ¥,| 103.246.146.683 1.000] 17898 _ rosie 
|R| 1.000 ~ 17519 
632 1.000 
510 574 1,000 
115 322.182 1.000 
154.213. 146 = 683 1.000 


504 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices 


Using Bartlett's correction, we evaluate the test statistic in (9-39): 


LL'+ ¥ 
[n -1—- (2p + 4m+ 5)/6}1n at 
(10 + 8 + 5) 
= [103 ~ 1 - 7" |n(1.0216) = 2.10 


Since }[(p ~ my? - p ~ m] =3[(5 ~ 2)? - 5-2] = 1, the 5% critical value 
x3(.05) = 3.84 is not exceeded, and we fail to reject Hy. We conclude that the data do 
not contradict a two-factor model. In fact, the observed significance level, or P-value, 
P[x? > 2.10] = .15 implies that Hp would not be rejected at any reasonable level. my 


Large sample variances and covariances for the maximum likelihood estimates 
€;;, #; have been derived when these estimates have been determined from the sample 
covariance matrix §, (See [10].) The expressions are, in general, quite complicated. 


9.4 Factor Rotation 


As we indicated in Section 9.2, all factor loadings obtained from the initial loadings 
by an orthogonal transformation have the same ability to reproduce the covariance 
(or correlation) matrix. [See (9-8).] From matrix algebra, we know that an orthogo- 
nal transformation corresponds to a rigid rotation (or reflection) of the coordinate 
axes. For this reason, an orthogonal transformation of the factor loadings, as well as 
the implied orthogonal transformation of the factors, is called factor rotation. 

If L is the p X m matrix of estimated factor loadings obtained by any methad 
(principal component, maximum likelihood, and so forth) then 


L*=LT,  whereTl’ =TT=1 (9-42) 


is a p X m matrix of “rotated” loadings. Moreover, the estimated covariance (or 
correlation) matrix remains unchanged, since 


LLi+ & =LIrL+ v =D + & (9-43) 


Equation (9-43) indicates that the residual matrix, S, ~ LL’ — & = S$, ~ L*L*' — &, 
remains unchanged. Moreover, the specific variances ;, and hence the communalities 
h?, are unaltered. Thus, from a mathematical viewpoint, it is immaterial whether L 
or L* is obtained. 

Since the original loadings may not be readily interpretable, it is usual practice 
to rotate them until a “simpler structure” is achieved. The rationale is very much 
akin to sharpening the focus of a microscope in order to see the detail more clearly. 

Ideally, we should like to see a pattern of loadings such that each variable loads 
highly on a single factor and has small to moderate loadings on the remaining factors. 
However, it is not always possible to get this simple structure, although the rotated load- 
ings for the decathlon data discussed in Example 9.11 provide a nearly ideal pattern. 

We shall concentrate on graphical and analytical methods for determining an 
orthogonal rotation to a simple structure. When m = 2, or the common factors are 
considered two at a time, the transformation to a simple structure can frequently be 
determined graphically. The uncorrelated common factors are regarded as unit 


Factor Rotation 505 


vectors along perpendicular coordinate axes. A plot of the pairs of factor loadings 
(€:1, €:2) yields p points, each point corresponding to a variable. The coordinate axes 
can then be visually rotated through an angle—call it ¢—and the new rotated load- 
ings €;, are determined from the relationships 


I- = © T (9-44) 
(px2) — (pX2)(22) 


T= | cos @ a clockwise 


—sing cos rotation 
where 
T= ie to) —sing | counterclockwise 
sind cos rotation 


The relationship in (9-44) is rarely implemented in a two-dimensional graphical 
analysis. In this situation, clusters of variables are often apparent by eye, and these 
clusters enable one to identify the common factors without having to inspect the mag- 
nitudes of the rotated loadings. On the other hand, for m > 2, orientations are not 
easily visualized, and the magnitudes of the rotated loadings must be inspected to find 
a meaningful interpretation of the original data. The choice of an orthogonal matrix T 
that satisfies an analytical measure of simple structure will be considered shortly. 


Example 9.8 (A first look at factor rotation) Lawley and Maxwell [10] present the 
sample correlation matrix of examination scores in p = 6 subject areas for 
= 220 male students. The correlation matrix is 


Gaelic English History Arithmetic Algebra Geometry 


1.0 439° 410 288 329 248 
1.0 351 354 320 329 
x 1.0 164 190 181 
1.0 595 470 
1.0 464 
1.0 


and a maximum likelihood solution for m = 2 common factors yields the estimates 
in Table 9.5. 


Table 9.5 
Estimated 
factor loadings Communalities 

Variable 2 F, h? 
1. Gaelic 553 429 490 
2. English 568 .288 406 
3. History 392 450 356 
4, Arithmetic 740 =12 13 623 
5. Algebra .724 —.211 569 
6. Geometry 595 —.132 372 


506 Chapter9 Factor Analysis and Inference for Structured Covariance Matrices 


All the variables have positive loadings on the first factor. Lawley and 
Maxwell suggest that this factor reflects the overall response of the students to ip. 
struction and might be labeled a general intelligence factor. Half the loadings are 
positive and half are negative on the second factor. A factor with this pattern of. 
loadings is called a bipolar factor. (The assignment of negative and positive Poles* 
is arbitrary, because the signs of the loadings on a factor can be reversed without 
affecting the analysis.) This factor is not easily identified, but is such that individu. 
als who get above-average Scores on the verbal tests get above-average scores on’ 
the factor. Individuals with above-average scores on the mathematical tests get 
below-average scores on the factor. Perhaps this factor can be classified as a 

“math-nonmath” factor. : 

The factor loading pairs ( é > é.») are plotted as points in Figure 9.1. The points 
are labeled with the numbers of the corresponding variables. Also shown is a clack- 
wise orthogonal rotation of the coordinate axes through an angle « of ¢ = 20°. This 


angle was chosen so that one of the new axes passes through (2, tis): When this is 
done, all the points fall in the first quadrant (the factor loadings are all positive), and 
the two distinct clusters of variables are more clearly revealed. 4 

The mathematical test variables load highly on Fj and have negligible load- 
ings on F>. The first factor might be called a mathematical-ability factor. Similarly, 
the three verbal test variables have high loadings on F3 and moderate to small 
loadings on Fy. The second factor might be labeled a verbal-ability factor. 
The general- intelligence factor identified initially is submerged in the factors F} 
and F3. 

The rotated factor loadings obtained from (9-44) with ¢@ = 20° and the 
corresponding communality estimates are shown in Table 9.6. The magnitudes of 
the rotated factor loadings reinforce the interpretation of the factors suggested by 
Figure 9.1. 

The, communality estimates are unchanged by the orthogonal rotation, since 
LL’ = LIT'L’ = L*L*’, and the communalities are the diagonal elements of these 
matrices, 

We point out that Figure 9.1 suggests an ob/ique rotation of the coordinates. 
One new axis would pass through the cluster {1,2,3} and the other through the 
{4,5,6} group. Oblique rotations are so named because they correspond to a 
nonrigid rotation of coordinate axes leading to new axes that are not perpendicular. 


Figure 9.1 Factor rotation for test 
scores. 


Factor Rotation 507 


Table 9.6 


Estimated rotated 
factor loadings 


Communalities 
Fy Fy 


hi? = hi 


Variable 


1. Gaelic 490 
2. English 406 
3. History 356 
4. Arithmetic 623 
5. Algebra 568 
6. Geometry 372 


It is apparent, however, that the interpretation of the oblique factors for this 
example would be much the same as that given previously for an orthogonal 
rotation. = 


Kaiser [9] has suggested an analytical measure of simple structure known as the 
varimax (or normal varimax) criterion. Define €}; = €;;/h; to be the rotated coeffi- 
cients scaled by the square root of the communalities. Then the (normal) varimax 
procedure selects the orthogonal transformation T that makes 


1A] Ses P myn \? 
as large as possible. 


Scaling the rotated coefficients a; has the effect of giving variables with small 
communialities relatively more weight in the determination of simple structure. 
After the transformation T is determined, the loadings ¢;; are multiplied by A; so 
that the original communalities are preserved. 

Although (9-45) looks rather forbidding, it has a simple interpretation. In 
words, 


— { variance of squares of (scaled) loadings for ' 
2 ( jth factor oe) 


j=1 


Effectively, maximizing V corresponds to “spreading out” the squares of the load- 
ings on each factor as much as possible. Therefore, we hope to find groups of large 
and negligible coefficients in any column of the rotated loadings matrix L*. 

Computing algorithms exist for maximizing V, and most popular factor analysis 
computer programs (for example, the statistical software packages SAS, SPSS, 
BMDP, and MINITAB) provide varimax rotations. As might be expected, varimax 
rotations of factor loadings obtained by different solution methods (principal com- 
ponents, maximum likelihood, and so forth) will not, in general, coincide. Also, the 
pattern of rotated loadings may change considerably if additional common factors 
are included in the rotation. If a dominant single factor exists, it will generally be ob- 
scured by any orthogonal rotation. By contrast, it can always be held fixed and the 
remaining factors rotated. 


508 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices 


Example 9.9 (Rotated loadings for the consumer-preference data) Let us return to _ 
the marketing data discussed in Example 9.3. The original factor loadings (obtained * 
by the principal component method), the communalities, and the (varimax) rotated * 
factor loadings are shown in Table 9.7. (See the SAS statistical software output in.” 
Panel 9.1.) 


a 


Estimated Rotated 


factor estimated factor 
loadings loadings 
Variable F, F, 


- Taste 

. Good buy for money 

- Flavor 

. Suitable for snack 

. Provides lots of energy 


Cumulative proportion | 
of total (standardized) 


sample variance explained 


It is clear that variables 2, 4, and 5 define factor 1 (high loadings on factor 1, 
small or negligible loadings on factor 2), while variables 1 and 3 define factor 2 (high 
loadings on factor 2, small or negligible loadings on factor 1). Variable 4 is most 
closely aligned with factor 1, although it has aspects of the trait represented by 
factor 2. We might call factor 1 a nutritional factor and factor 2 a taste factor. 

The factor loadings for the variables are pictured with respect to the original 
and (varimax) rotated factor. axes in Figure 9.2. = 


Figure 9.2 Factor rotation for 
Fi hypothetical marketing data. 


Factor Rotation 509 


PANEL 9.1 SAS ANALYSIS FOR EXAMPLE 9.9 USING PROC FACTOR. 


title ‘Factor Analysis’; 

data consumer(type = corr); 

_type_='CORR’; 

input __name_$ taste money flavor snack energy; 
cards; 

taste 


money 


flavor ; ; 1.00 , ‘ PROGRAM COMMANDS 
snack ‘ ‘ -50 1.00 ‘ 
energy i F 11 79 ~=1.00 


proc factor res data=consumer 
method=prin nfact=2rotate=varimax preplot plot; 
var taste money flavor snack energy; 


Initia) Factor Method: Principal Components | OUTPUT 


Prior Communality Estimates: ONE 


Eigenvalues of the Correlation Matrix: Total = 5 Average = 1 


1 2 3 4 5 
Eigenvalue 2.853090 1.806332 0.204490 0.102409 0.033677 
Difference 1.046758 1.601842 0.102081 0.068732 
Proportion 0.5706 0.3613 0.0409 0.0205 0.0067 


Cumulative 0.5706 0.9319 0.9728 0.9933 1.0000 


2 factors will be retained by the NFACTOR criterion. 


Factor Pattern 


FACTOR1 FACTOR2. 


TASTE 0.55986 0.81610 
MONEY 0.77726 —0,52420 
FLAVOR 0.64534 0.74795 


SNACK 0.93911 0.10492 
ENERGY | 0.79821 . -0.54323 


Final Communality Estimates: | Total = 4.659423 


TASTE MONEY FLAVOR SNACK ENERGY 


0.97961 0.878920 


(continues on next page) 


510 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices 


PANEL 9.1 (continued) 


Rotation Method: Varimax 


Rotated Factor Pattern 


FACTOR? — FACTOR2 
TASTE 0.01970 0.98948 
MONEY 0.93744 0.01123 
FLAVOR -| 0.12856 0.97947 
SNACK 0.84244 0.42805 
ENERGY | 0.96539 -0,01563 


Variance explained by each factor 


FACTOR1 FACTOR2 
2.537396 2.122027 


Rotation of factor loadings is recommended particularly for loadings 
obtained by maximum likelihood, since the initial values are constrained to satisfy 
the uniqueness condition that L’¥~'L be a diagonal matrix. This condition is 
convenient for computational purposes, but may not lead to factors that can easily 
be interpreted. 


Example 9.10 (Rotated loadings for the stock-price data) Table 9.8 shows the initial 
and rotated maximum likelihood estimates of the factor loadings for the stock-price 
data of Examples 8.5 and 9.5. An m = 2 factor model is assumed. The estimated 


Table 9.8 . 
Maximum likelihood | 
estimates of factor Rotated estimated Specific 
loadings factor loadings variances 
- Variable F F, F* Fx w=1-h? 

J P Morgan 115 .755 
Citibank 322 .788 
Wells Fargo .182 652 z . ; 
Royal Dutch Shell | 1.000 ~.000 118 .993 .00 
ExxonMobil 683 .032 113 675 53 
Cumulative ip 
proportion 
of total 
sample variance 
explained 


Factor Rotation 511 


specific variances and cumulative proportions of the total (standardized) sample vari- 
ance explained by each factor are also given. 

An interpretation of the factors suggested by the unrotated loadings was pre- 
sented in Example 9.5. We identified market and industry factors. 

The rotated loadings indicate that the bank stocks (JP Morgan, Citibank, and 
Wells Fargo) load highly on the first factor, while the oil stocks (Royal Dutch 
Shell and ExxonMobil) load highly on the second factor. (Although the rotated 
loadings obtained from the principal component solution are not displayed, the 
same phenomenon is observed for them.) The two rotated factors, together, 
differentiate the industries. It is difficult for us to label these factors intelligently. 
Factor 1 represents those unique economic forces that cause bank stocks to 
move together. Factor 2 appears to represent economic conditions affecting oil 
stocks. 

As we have noted, a general factor (that is, one on which all the variables load 
highly) tends to be “destroyed after rotation.” For this reason, in cases where a gen- 
eral factor is evident, an orthogonal rotation is sometimes performed with the gen- 
eral factor loadings fixed. - 


Example 9.11 (Rotated loadings for the Olympic decathlon data) The estimated 
factor loadings and specific variances for the Olympic decathlon data were 
presented in Example 9.6. These quantities were derived for an m = 4 factor 
model, using both principal component and maximum likelihood solution 
methods. The interpretation of all the underlying factors was not immediately 
evident. A varimax rotation [see (9-45)] was performed to see whether the rotated 
factor loadings would provide additional insights. The varimax rotated loadings 
for the m = 4 factor solutions are displayed in Table 9.9, along with the specific 
variances. Apart from the estimated loadings, rotation will affect only the distribu- 
tion of the proportions of the total sample variance explained by each factor. The 
cumulative proportion of the total sample variance explained for all factors does 
not change. 

The rotated factor loadings for both methods of solution point to the same 
underlying attributes, although factors 1 and 2 are not in the same order. We see 
that shot put, discus, and javelin load highly on a factor, and, following Linden 
[11], this factor might be called explosive arm strength. Similarly, high jump, 
110-meter hurdles, pole vault, and—to some extent—long jump load highly on 
another factor. Linden labeled this factor explosive leg strength. The 100-meter 
run, 400-meter run, and—again to some extent—the long jump load highly on a 
third factor. This factor could be called running speed. Finally, the 1500-meter run 
loads heavily and the 400-meter run loads heavily on the fourth factor. Linden 
called this factor running endurance. As he notes, “The basic functions indicated in 
this study are mainly consistent with the traditional classification of track and 
field athletics.” 


5Some general-purpose factor analysis programs allow one to fix loadings associated with certain 
factors and to rotate the remaining factors. 


512 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices 


Table 9.9 


Variable 


Principal component Maximum likelihood 


Estimated Estimated ~ 
rotated Specific rotated | Specific’ ~’) 
factor loadings, €7, variances factor af Gy variances 


Fi FR F3 Fy |®21-8] AFA =1~ et 


hurdles 


1500-m 
run 


Cumulative 
proportion 
of total 
sample 
variance 

| xplained 


F3 
100-m : 
Long i 
jump .291 [.664] £435? .055 280 [554] £451: . 39. | 
Shot | 
put 302 .252 —.097 17 [883] 278 .228 ~.045 09 
High 
_ jump 267 .221 [683] .293 057.242 
400-m ; 
run 086 068 [507] 17 142.151 [700] 20 
110-m [ 


048 .108 ~161 
185 .204 —.076 2B 


324 278 


024 054 188 


£465! 173 ~.033 
220.133 ~.009 30 


~.002 019 .075 [923] 15 001 .110 -.070 60 


a a 


22 43 .62 16 


Plots of rotated maximum likelihood loadings for factors pairs (1,2) 
and (1,3) are displayed in Figure 9.3 on page 513. The points are generally 
grouped along the factor axes. Plots of rotated principal component loadings are 
very similar. 7 


Oblique Rotations 


Orthogonal rotations are appropriate for a factor model in which the common fac- 
tors are assumed to be independent. Many investigators in social sciences consider 
oblique (nonorthogonal) rotations, as well as orthogonal rotations. The former are 


Factor Scores 513 


1.0 1.0 


0.8 0.8 
a 0.6/- oo 0.6 
8 6 8 2 
is} oO 
a : £ e 
0.4 [ 0.4 
0.2 ) 3 02 : : 
rss 
e ® 
0.0 0.0 @Q | 
| l | l 
0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 
Factor 3 Factor ] 


Figure 9.3 Rotated maximum likelihood loadings for factor pairs (1, 2) and (1, 3)— 
decathlon data. (The numbers in the figures correspond to variables.) 


often suggested after one views the estimated factor loadings and do not follow 
from our postulated model. Nevertheless, an oblique rotation is frequently a useful 
aid in factor analysis. 

If we regard the m common factors as coordinate axes, the point with the m 


coordinates (€;;, €;2,...,€:m) represents the position of the ith variable in the factor 
space, Assuming that the variables are grouped into nonoverlapping clusters, an or- 
thogonal rotation to a simple structure corresponds to a rigid rotation of the coordi; 
nate axes such that the axes, after rotation, pass as closely to the clusters as possible. 
An oblique rotation to a simple structure corresponds to a nonrigid rotation of the 
coordinate system such that the rotated axes (no longer perpendicular) pass (near- 
ly) through the clusters. An oblique rotation seeks to express each variable in terms 
of a minimum number of factors—preferably, a single factor. Oblique rotations are 
discussed in several sources (see, for example, [6] or [10]) and will not be pursued in 
this book. 


9.5 Factor Scores 


In factor analysis, interest is usually centered on the parameters in the factor model. 
However, the estimated values of the common factors, called factor scores, may also 
be required. These quantities are often used for diagnostic purposes, as well as in- 
puts to a subsequent analysis. ° 

Factor scores are not estimates of unknown parameters in the usual sense. 
Rather, they are estimates of values for the unobserved random factor vectors F,, 
j = 1,2,...,n. That is, factor scores 


f, = estimate of the values f; attained by F; (jth case) 


514 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices 


The estimation situation is complicated by the fact that the unobserved quantities f. 
and e; outnumber the observed x;. To overcome this difficulty, some rather heuris. 
tic, but reasoned, approaches to the problem of estimating factor values have been 
advanced. We describe two of these approaches. 

Both of the factor score approaches have two elements in common: 


1. They treat the estimated factor loadings é j and specific variances bi as if they 
were the true values. 


2. They involve linear transformations of the original data, perhaps centered 
or standardized. Typically, the estimated rotated loadings, rather than the 
original estimated loadings, are used to compute factor scores. The com- 
putational formulas, as given in this section, do not change when rotated load- 
ings are substituted for unrotated loadings, so we will not differentiate 
between them. 


The Weighted Least Squares Method 


Suppose first that the mean vector yw, the factor loadings L, and the specific variance 
YY are known for the factor model 


X - w= L Fite 
p (px1)— (px1)_— (pXm)(mx1) (px 1) 
Further, regard the specific factors e’ = [e, €,...,€)] aS errors. Since 
Var(e;) = w;, i = 1,2,...,p, need not be equal, Bartlett [2] has suggested that 
weighted least squares be used to estimate the common factor values. 
The sum of the squares of the errors, weighted by the reciprocal of their 
variances, is 


= eV le = (x ~ w — Lf)'¥ (x — pw - Lf) (9-47) 


i 
= {(% 


Bartlett proposed choosing the estimates f of f to minimize (9-47). The solution (see 
Exercise 7.3) is 


f= (UWL) L(x - p) (9-48) 


Motivated by (9-48), we take the estimates L, ¥, and g = xX as the true values and 
obtain the factor scores for the jth case as 


aA 


f, = (LY "L) "LY, - x) (9-49) 


When L and ¥ are determined by the maximum likelihood methad, these estimates 


must satisfy the uniqueness condition, L'¥ L = A, a diagonal matrix. We then 
have the following: 


Factor Scores 515 


Factor Scores Obtained by Weighted Least Squares 
from the Maximum Likelihood Estimates 


f= (L¥L) "L(x; - @) 
= ATED (x,-%), 7 =1,2,...,0 
or, if the correlation matrix is factored (9-50) 
f= (LW DL) OL ee 2, 
= APUG 2, F128 


where z; = D~¥(x; — x), as in (8-25), and p = L,L; + ¥,. 


The factor scores generated by (9-50) have sample mean vector 0 and zero sample 
covariances. (See Exercise 9.16.) 

If rotated loadings L* = LT are used in place of the original loadings in (9-50), 
the subsequent factor scores, ft, are related to f; by f* = TE, j =1,2,...." 


Comment. If the factor loadings are estimated by the principal component 
method, it is customary to generate factor scores using an unweighted (ordinary) 
least squares procedure. Implicitly, this amounts to assuming that the y; are equal or 
nearly equal. The factor scores are then 


f, = (LL) "L(x; — x) 


or 
= (LiL,) "Liz; 
for standardized data. Since L = Wana i Ries fe. hee | [see (9-15)], 
we have 
1 4 - 
Fe a(x; ~ ¥) 
Why 
{ @5(x; — x) 
5 Fe €2( xj — 
f=| Vi,” (9-51) 


lx, = 
= €mn(x; — X) 
Vin 
For these factor scores, 


“> ; = 0 (sample mean) 


and 


n aA 
> ££ = 1 (sample covariance) 


516 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices 


a 


Comparing (9-51) with (8-21), we see that the f; are nothing more than the first 
(scaled) principal components, evaluated at x;. 


The Regression Method 


Starting again with the original factor model X — wp = LF + e, we initially treat 
the loadings matrix L and specific variance matrix ¥ as known. When the common 
factors F and the specific factors (or errors) e are jointly normally distributed with 
means and covariances given by (9-3), the linear combination X — wp = LF + e has 
an N,(0,LL' + ¥) distribution..(See Result 4.3.) Moreover, the joint distribution 
of (X — mw) and Fis N,,+,(0, *), where 


S=WW+vi L 


Pe RAT (exp) __§ (pxm) ae 
(m+ p)X(m+p) L ) 
(mxp) —} (mxm) 


and 0 is an (m + p) X J vector of zeros. Using Result 4.6, we find that the condi- 
tional distribution of F|x is multivariate normal with 


mean = E(F|x) = L'S "(x — ») = L'(LL’ + Vv) (x —p) (9-53) 
and 
covariance = Cov(F|x) =1-L’S(L=1- L(LL' + ¥)"L (9-54) 


The quantities L'(LL’ + ¥)” in (9-53) are the coefficients in a (multivariate) re- 
gression of the factors on the variables. Estimates of these coefficients produce 
factor scores that are analogous to the estimates of the conditional mean values in 
multivariate regression analysis. (See Chapter 7.) Consequently, given any vector of 
observations x;, and taking the maximum likelihood estimates L and as the true val- 
ues, we see that the jth factor score vector is given by 


f= LS x, —%) = LLL’ + #)"(x;-%), 7 =1,2,...,n (9-55) 
The calculation of f in (9-55) can be simplified by using the matrix identity (see 
Exercise 9.6) 
Lie’ + &®) = a+b eb bP oe (9-56) 
(mXp) (pXp) (mxm1) (xp) (pXp) 

This identity allows us to compare the factor scores in (9-55), generated by the re- 

gression argument, with those generated by the weighted least squares procedure 

[see (9-50)]. Temporarily, we denote the former by f¥ and the latter by f}*. Then, 
using (9-56), we obtain 

mB = (Deb) d+ Debye = (+ Ly) (9-57) 


For maximum likelihood estimates (L'¥“1L)! = A“ and if the elements of this 
diagonal matrix are close to zero, the regression and generalized least squares 
methods will give nearly the same factor scores. 


Factor Scores 517 


In an attempt to reduce the effects of a (possibly) incorrect determination of 
the number of factors, practitioners tend to calculate the factor scores in (9-55) by 
using S (the original sample covariance matrix) instead of = = LL’ + ¥. We then 
have the following: 


Factor Scores Obtained by Regression 


f,=L's(x;-%), f=1,2,...,n 
or, if a correlation matrix is factored, (9-58) 
f,=LR"2z,, j= 1,2,...,7 
where, see (8-25), 


j= D(x; — x) and p=Lii+ ¥, 


Again, if rotated loadings L* = LT are used in place of the original loadings in 
(9-58), the subsequent factor scores f; are related to f; by 


=Tf, j=1,2,....0 


A numerical measure of agreement between the factor scores generated from 
two different calculation methods is provided by the sample correlation coefficient 
between scores on the same factor. Of the methods presented, none is recommended 
as uniformly superior. 


Example 9.12 (Computing factor scores) We shall illustrate the computation of fac- 
tor scores by the least squares and regression methods using the stock-price data 
discussed in Example 9.10. A maximum likelihood solution from R gave the esti- 
mated rotated loadings and specific variances 


763.024 42 0 0 0 0 
821 227 0 27 0 0 0 
Lz =| 669 104] and ¥,=| 0 0 54 0 0 
118 .993 0 0 0 00 0 
113.675 0 0 0 0 3 


The vector of standardized observations, 
z' = [.50, —1.40, —.20, —.70, 1.40] 


yields the following scores on factors 1 and 2: 


518 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices 


Weighted least squares (9-50):® 


f= (LPP LS) LY op 2 = ee 


Regression (9-58): 


-50 


Saas -1.40 
f= iret =| 331.526.221.137 a 


~040 —.063 -.026 1.023 ~.001 as 


1.40 


In this case, the two methods produce very similar results. All of the re 
factor scores, obtained using (9-58), are plotted in Figure 9.4. mee 


Comment. Factor scores with a rather pleasing intuitive property can be cong 
structed very simply. Group the variables with high (say, greater than 
absolute value) loadings on a factor. The scores for factor 1 are then formed:t 
summing the (standardized) observed values of the variables in the grou 


Factor 2 


Factor | 


Figure 9.4 Factor scores using (9-58) for factors 1 and 2 of the stock-price da 
(maximum likelihood estimates of the factor loadings). 


: 6 In order to calculate the weighted least squares factor scores, .00 in the fourth diagon: 
‘¥, was set to .01 so that this matrix could be inverted. : 


Perspectives and a Strategy for Factor Analysis 519 


on factor 2, and so forth. Data reduction is accomplished by replacing the stan- 
dardized data by these simple factor scores. The simple factor scores are frequently 
highly correlated with the factor scores obtained by the more complex least 
squares and regression methods. 


Example 9.13 (Creating simple summary scores from factor analysis groupings) The 
principal component factor analysis of the stock price data in Example 9.4 produced 
the estimated loadings 


132° 437 852.030 
831 ~.280 851.214 
L =| .726 -.374| and L*=1T =| 813 .079 
605 694 133 911 
563.719 084 .909 


For each factor, take the loadings with largest absolute value in L as equal in magni- 
tude, and neglect the smaller loadings. Thus, we create the linear combinations 

fr =x tx tx tary t xs 

fy = 34 + x5 - x 
as a summary. In practice, we would standardize these new variables. 


If, instead of L, we start with the varimax rotated loadings L*, the simple factor 
scores would be 


fy = x1 + x2 + x3 


h = x4 + X5 
The identification of high loadings and negligible loadings is really quite subjective. 
Linear compounds that make subject-matter sense are preferable. | 


Although multivariate normality is often assumed for the variables in a factor 
analysis, it is very difficult to justify the assumption for a large number of variables. 
AS we pointed out in Chapter 4, marginal transformations may help. Similarly, the 
factor scores may or may not be normally distributed. Bivariate scatter plots of fac- 
tor scores can produce all sorts of nonelliptical shapes. Plots of factor scores should 
be examined prior to using these scores in other analyses. They can reveal outlying 
values and the extent of the (possible) nonnormality. 


9.6 Perspectives and a Strategy for Factor Analysis 


There are many decisions that must be made in any factor analytic study. Probably 
the most important decision is the choice of m, the number of common factors. 
Although a large sample test of the adequacy of a model is available for a given ™m, it 
is suitable only for data that are approximately normally distributed. Moreover, the 
test will most assuredly reject the model for small m if the number of variables and 
observations is large. Yet this is the situation when factor analysis provides a useful 
approximation. Most often, the final choice of m is based on some combination of 


520 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices 


(1) the proportion of the sample variance explained, (2) subject-matter knowledge 
and (3) the “reasonableness” of the results. : 

The choice of the solution method and type of rotation is a less crucial dec. 
sion. In fact, the most satisfactory factor analyses are those in which rotations are 
tried with more than one method and all the results substantially confirm the same 
factor structure. 

At the present time, factor analysis still maintains the flavor of an art, and no 
single strategy should yet be “chiseled into stone.” We suggest and illustrate one 
reasonable option: 

1. Perform a principal component factor analysis. This method is Particularly 
appropriate for a first pass through the data. (It is not required that R or S be 
nonsingular.) : 
(a) Look for suspicious observations by plotting the factor scores. Also, 

calculate standardized scores for each observation and squared distances as 
described in Section 4.6. 

(b) Try a varimax rotation. 

2. Perform a maximum likelihood factor analysis, including a varimax rotation. 


3. Compare the solutions obtained from the two factor analyses. 
(a) Do the loadings group in the same manner? 


(b) Plot factor scores obtained for principal components against scores from 
the maximum likelihood analysis. 


4. Repeat the first three steps for other numbers of common factors m. Do extra fac- 
tors necessarily contribute to the understanding and interpretation of the data? 


5. For large data sets, split them in half and perform a factor analysis on each part. 
Compare the two results with each other and with that obtained from the com- 
plete data set to check the stability of the solution. (The data might be divided 
by placing the first half of the cases in one group and the second half of the 
cases in the other group. This would reveal changes over time.) 


Example 9.14 (Factor analysis of chicken-bone data) We present the results of sev- 
eral factor analyses on bone and skull measurements of white leghorn fow]. The 
original data were taken from Dunn [5]. Factor analysis of Dunn’s data was orig- 
inally considered by Wright [15], who started his analysis from a different corre- 
lation matrix than the one we use. 

The full data set consists of n = 276 measurements on bone dimensions: 


X, = skull length 
d: 
Hea i = skull breadth 
: X, = femur length 
Leg: i = tibia length 
Xs = humerus length 
Wing: 
ng Ns = ulna length 


Perspectives and a Strategy for Factor Analysis 521 


The sample correlation matrix 


1.000 505 .569 .602 .621  .603 
505 1.000 .422 467 482 «450 
569 =.422, «1.000 = .926-—Ss 877 -_—878 
602 467 .926 1.000 874 .894 
621 482 .877 874 1.000 .937 
603 .450 .878 894 .937 1.000 


was factor analyzed by the principal component and maximum likelihood methods 
for an m = 3 factor model. The results are given in Table 9.10.’ 


| Principal Component 


Rotated estimated loadings 

Variable Fi F} FS 

. Skull length 
. Skull breadth .604 720 —.340 
. Femur length 929 —.233 -.075 
. Tibia length 943, -—.175 ~--.067 
. Humerus length 
. Ulna length 


HDMnhWnNe 


Cumulative 
proportion of 

total (standardized) 
sample variance 
explained .743 873 -950 576 763 950 


Maximum Likelihood 


Estimated factor loadings | Rotated estimated loadings 


Variable K F, F; Fi F3 F3 f 
1. Skull length .602 214 .286 467 128 51 
2. Skull breadth 467 177 652 792 .050 33 
3. Femur length .926 145 —.057 .289 084 12 
4. Tibia length 1.000 .000 -—.000 345 —.073 .00 
5. Humerus length 874 463 —.012 .362 396 .02 
6. Ulna length 894 336 325 272 


Cumulative 
proportion of 

total (standardized) 
sample variance 
explained 667 


7 Notice the estimated specific variance of .00 for tibia length in the maximum likelihood solution. 
This suggests that maximizing the likelihood function may produce a Heywood case. Readers attempting _ 
to replicate our results should try the Hey(wood) option if SAS or similar software is used. 


522 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices 


After rotation, the two methods of solution appear to give somewhat different 
results, Focusing our attention on the principal component method and the cumuta. 
tive proportion of the total sample variance explained, we see that a three-factor so. 
lution appears to be warranted. The third factor explains a “significant” amount of 
additional sample variation. The first factor appears to be a body-size factor domi. 
nated by wing and leg dimensions. The second and third factors, collectively, repre. 
sent skull dimensions and might be given the same names as the variables, skujj 
breadth and skull length, respectively. 

The rotated maximum likelihood factor loadings are consistent with those gen. 
erated by the principal component method for the first factor, but not for factors 2 
and 3. For the maximum likelihood method, the second factor appears to represent 
head size. The meaning of the third factor is unclear, and it is probably not needed, 

Further support for retaining three or fewer factors is provided by the residual 
matrix obtained from the maximum likelihood estimates: 


000 
—.000  .000 
a. —.003 .001 000 
R-Ll.- ¥:=| 99 000 000.000 


-.001 .000 .000 000 § .000 
004 —.001 -.001 00 -.000 .000 


All of the entries in this matrix are very small. We shall pursue the m = 3 factor 
model in this example. An m = 2 factor model is considered in Exercise 9.10. 

Factor scores for factors 1 and 2 produced from (9-58) with the rotated maxi- 
mum likelihood estimates are plotted in Figure 9.5. Plots of this kind allow us to 
identify observations that, for one reason or another, are not consistent with the 
remaining observations. Potential outliers are circled in the figure. 

It is also of interest to plot pairs of factor scores obtained using the principal 
component and maximum likelihood estimates of factor loadings. For the chicken- 
bone data, plots of pairs of factor scores are given in Figure 9.6 on pages 524-526. If 
the Joadings on a particular factor agree, the pairs of scores should cluster tightly 
about the 45° line through the origin. Sets of loadings that do not agree will produce 
factor scores that deviate from this pattern. If.the latter occurs, it is usually associat- 
ed with the last factors and may suggest that the number of factors is too large. That 
is, the last factors are not meaningful. This seems to be the case with the third factor 
in the chicken-bone data, as indicated by Plot (c) in Figure 9.6, 

Plots of pairs of factor scores using estimated loadings from two solution 
methods are also good tools for detecting outliers. If the sets of loadings for a factor 
tend to agree, outliers will appear as points in the neighborhood of the 45° line, but 
far from the origin and the cluster of the remaining points. It is clear from Plot (b) in 
Figure 9.6 that one of the 276 observations is not consistent with the others. It has an 
unusually large F,-score. When this point, [39.1,39.3, 75.7, 115, 73.4, 69.1], was 
removed and the analysis repeated, the loadings were not altered appreciably. 

When the data set is large, it should be divided into two (roughly) equal sets, 
and a factor analysis should be performed on each half. The results of these analyses 
can be compared with each other and with the analysis for the full data set to 


Perspectives and a Strategy for Factor Analysis 523 


eeogseoese 
wcose eof Seg 


=3 -2 -1 0 1 2 3 


Figure 9.5 Factor scores for the first two factors of chicken-bone data. 


test the stability of the solution. If the results are consistent with one another, 
confidence in the solution is increased. 

The chicken-bone data were divided into two sets of nm; = 137 and nz = 139 
observations, respectively. The resulting sample correlation matrices were 


1.000 
696 1.000 


Ri=| 639 575 901 1.000 
694 .606 .844 835 1.000 
660 584 .866 863 .931 1.000 
and 
1.000 
366 1.000 
572.352 1.000 
R2 = 


587 .406 950 1.000 
587 420 .909 911 1.000 
598 «6.386 = 894. 9927S 940 :1.000 


524 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices 


Wwe 


Maximum 
likelihood 


Poe es Tw he ee 
-3.5 ~-3.0 -25 -20 -15 -10 -.5 0 es) 1.0 1.5 2.0 25 3.0 


(a) First factor 


Figure 9.6 Pairs of factor scores for the chicken-bone data. (Loadings are 
estimated by principal component and maximum likelihood methods.) 


The rotated estimated loadings, specific variances, and proportion of the total 
(standardized) sample variance explained for a principal component solution of an 
m = 3 factor model are given in Table 9.11 on page 525. 

The results for the two halves of the chicken-bone measurements are very simi- 
lar. Factors F> and F3 interchange with respect to their labels, skull length and skull 
breadth, but they collectively seem to represent head size. The first factor, Fj, again 
appears to be a body-size factor dominated by leg and wing dimensions. These are 
the same interpretations we gave to the results from a principal component factor 
analysis of the entire set of data. The solution is remarkably stable, and we can be 
fairly confident that the large loadings are “real.” As we have pointed out however, 
three factors are probably too many. A one- or two-factor model is surely sufficient 
for the chicken-bone data, and you are encouraged to repeat the analyses here with 
fewer factors and alternative solution methods. (See Exercise 9.10.) = 


9.0 


75 


6.0 


4.5 


3.0 


Perspectives and a Strategy for Factor Analysis 525 


a Ele T iho tle =, 
Principal 
component 


) 


t 7] 
] 1 
2 12411 
1 23221 
12346231 
224331 
21 4646C611 


Maximum 
likelihood 


| —L | | | | | 2) 


Figure 9.6 
(continited) 


Table 9.11 


Variable 


=3.00 +2.25 —1.50. =.75 0 75 150 2.25 3.00 3.75 450 5.25 600 6.75 7.50 


(b) Second factor 


First set Second set 
(1, = 137 observations) (m2 = 139 observations) : 
Rotated estimated factor loadings | Rotated estimated factor loadings 


F) F3 


. Skull length 

. Skull breadth 

. Femur length 

. Tibia length 

. Humerus length 
. Ulna length 


361 


Cumulative proportion 
of total (standardized) 
Sample variance 
explained 


526 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices 


TT ] \ qT T 


2———— : 
& 1 Maximum 
1 likelihood 
I ] 
1 
] 
] 
—3.00 j | LL | \ | L | | | | 
—3.0 -24 -18 -12 -6 0 6 12 18 24 30 36 42 48 
(c) Third factor 


Figure 9.6 (continued) 


Factor analysis has a tremendous intuitive appeal for the behavioral and social 
sciences. In these areas, it is natural to regard multivariate observations on animal 
and human processes as manifestations of underlying unobservable “traits.” Factor 
analysis provides a way of explaining the observed variability in behavior in terms 
of these traits. 

Still, when all is said and done, factor analysis remains very subjective. Our exam- 
ples, in common with most published sources, consist of situations in which the factor 
analysis model provides reasonable explanations in terms of a few interpretable fac 
tors. In practice, the vast majority of attempted factor analyses do not yield such clear- 
cut results. Unfortunately, the criterion for judging the quality of any factor analysis 
has not been well quantified. Rather, that quality seems to depend on a 


WOW criterion 


If, while scrutinizing the factor analysis, the investigator can shout “Wow, I under- 
stand these factors,” the application is deemed successful. 


Supplement 


SOME COMPUTATIONAL DETAILS 
FOR MAXIMUM LIKELIHOOD 
ESTIMATION 


Although a simple analytical expression cannot be obtained for the maximum 

likelihood estimators L and WV, they can be shown to satisfy certain equations. Not 

surprisingly, the conditions are stated in terms of the maximum likelihood estimator 
n 


S, = (1/n) & (X; -— X) (X; — X)’ of an unstructured covariance matrix. Some 
j=1 


factor analysts employ the usual sample covariance §S, but still use the title maximum 
likelihood to refer to resulting estimates. This modification, referenced in Footnote 4 
of this chapter, amounts to employing the likelihood obtained from the Wishart 
n 
distribution of >) (X; — X)(X; — X)' and ignoring the minor contribution due to 
I 
the normal density for X. The factor analysis of R is, of course, unaffected by the 
choice of §,, or §, since they both produce the same correlation matrix. 


Result 9A.1. Let x,,x2,...,x, be a random sample from a normal population. 


The maximum likelihood estimates L and ¥ are obtained by maximizing (9-25) 
subject to the uniqueness condition in (9-26). They satisfy 


(P25 2) (POL) = (PL) (a + A) (9A-1) 


so the jth column of WL is the (nonnormalized) eigenvector of bs 2 
corresponding to eigenvalue 1 + A;. Here 


n a A 
S, = nt D (4) — ¥) Gj — HY = (nm — 1)8 and A, 2 A,2---2 An 
4s 


52? 


528 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices 


Also, at convergence, 
bi = ith diagonal element ofS, — LL’ 


and 
tr(Z1S,) = p 


We avoid the details of the proof. However, it is evident that g2 = X and a consideration 
of the log-likelihood leads to the maximization of —(n/2) [In] & | + tr(Z"'S,)} over L 
and WV. Equivalently, since S,, and P are constant with respect to the maximization, we 


minimize 
h(a, ¥,L) = In| X |~In|S,| + tr(Z"S,) — p 


subject to L'W"'L = A, a diagonal matrix. 


Comment. Lawley and Maxwell [10], along with many others who do factor 
analysis, use the unbiased estimate S$ of the covariance matrix instead of the maxi- 
mum likelihood estimate S,. Now, (x — 1)S has, for normal data, a Wishart distrib- 
ution. [See (4-21) and (4-23).] If we ignore the contribution to the likelihood in 
(9-25) from the second term involving (# — x), then maximizing the reduced likelj- 


hood over L and ¥ is equivalent to maximizing the Wishart likelihood 
Likelihood « | ¥ ("Yet D2) #{E"s] 
over Land WV. Equivalently, we can minimize 
In| X | + tr(z's) 


or, as in (9A-3), 
In| & | + tr(27!S) — In|S|—p 


Under these conditions, Result (9A-1) holds with S in place of S,,. Also, for large n, 
S and §,, are almost identical, and the corresponding maximum likelihood estimates, 
L and ¥, would be similar. For testing the factor model [see (9-39)], |LL’ + ¥| 
should be compared with |S,,| if the actual likelihood of (9-25) is employed, and 
|LL’ + | should be compared with | S| if the foregoing Wishart likelihood is used 


to derive L and V. 


Recommended Computational Scheme 


For m > 1, the condition L'Y™'L = A effectively imposes m(m —~ 1)/2 constraints 
on the elements of L and Y, and the likelihood equations are solved, subject to 


these contraints, in an iterative fashion. One procedure is the following: 


1. Compute initial estimates of the specific variances 4, #2,-..,,. Jéreskog [8] 


suggests setting 


where 5’! js the ith diagonal element of S$“. 


Some Computational Details for Maximum Likelihood Estimation 529 


2. Given ¥, compute the first m distinct eigenvalues, Xi > As Po > Am > 1, and 
corresponding eigenvectors, é;,@2,...,€,, of the “uniqueness-rescaled” covari- 


ance matrix 

st=v 2s yr (9A-5) 
Let E = [é, & fered Erm] be the p X m matrix of normalized eigenvectors 
and A = diag[Ay, A2s---s Am] be the m X m diagonal matrix of eigenvalues. 


From (9A-1), A =1 + - and E = ¥"LA~'. Thus, we obtain the estimates 

L = VIPRAY = HRA - 1? (9A-6) 

3. Substitute L obtained in (9A-6) into the likelihood function (9A-3), and 
minimize the result with respect to ay r ho, seed bp A numerical search routine 
must be used. The values by, 9,-- Vp obtained from this minimization are 
employed at Step (2) to create a new L. Steps (2) and (3) are repeated until con- 


vergence—that is, until the differences between successive values of ¢;; and 4; 
are negligible. 


Comment. It often happens that the objective function in’(9A-3) has a relative 
minimum corresponding to negative values for some g;. This solution is clearly 
inadmissible and is said to be improper, or a Heywood case. For most packaged 
computer programs, negative Wi, if they occur on a particular iteration, are changed 
to small positive numbers before proceeding with the next step. 


Maximum Likelihood Estimators of p = L,L, + W, 


When ¥ has the factor analysis structure Y = LL’ + W, p can be factored as 
p= VIFZVI? = (VILL) (VIAL) + V2 wy? = LL) + W,. The loading 
matrix for the standardized variables is L, = VL, and the corresponding specific 
variance matrix is ¥, = V'?¥V~!”, where V~/ is the diagonal matrix with ith 
diagonal element o7;”. If R is substituted for §,, in the objective function of (94-3), 
the investigator minimizes 


Clg 
|R| 


Introducing the diagonal matrix viz, whose ith diagonal element is the square 
root of the ith diagonal element of S,,, we can write the objective function in (9A-7) as 


) + tr[(L,Li + ¥,)"R] - p (94-7) 


ae + ¥,||Vi2| 
[v2 || RI;V¥2| 
= n(| (VPL,) (VAL) + V2, V2 | 
In 
|S, | 


+ tr[(L,Ly + ¥,)'V2VPRVI2V-?] ~ p 


+ tr[((V7L,) (V7L,)' + Vi2¥,V!2)"S,] — p 


=p { LE + ¥ | Rad eee 
2 In TS.) + tr[(LL’ + ¥)"S,] -— p (9A-8) 


530 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices 


Exercises 


The last inequality follows because the maximum likelihood estimates L and + 
minimize the objective function (9A-3). [Equality holds in (9A-8) for L, = VIR, 
and ¥, = V-¥2¥V-1 } Therefore, minimizing (9A-7) over L, and ¥, is equivalent 
to obtaining L and ¥ from S, and estimating L, = V'"L by L, = V2, ang 
¥, = VIP eV? by &, = VZV". The rationale for the latter procedure 
comes from the invariance property of maximum likelihood estimators. [See (4-20) ] 


———~—~—X<X=—_—OUOUOO——— ee ee, 


9.1. 


9.2. 


9.3. 


9.4. 


9.5. 


Show that the covariance matrix 


1.0 .63 .45 
p=| 63 10 35 
AS 35 1.0 


for the p = 3 standardized random variables Z;, Z,, and Z3 can be generated by the 
m = 1 factor model 


Z, = SF, + ej 
Z2 = TF, + &2 
Z3 = 5F, + €3 
where Var(F,) = 1, Cov(e, F,) = 0, and 
19 0 0 
YW =Cov(e)=|]0 51 0 
0 oO 75 


That is, write @ in the form p = LL' + ¥. 
Use the information in Exercise 9.1. 
(a) Calculate communalities h?, i = 1,2,3, and interpret these quantities. 


(b) Calculate Corr(Z,, F,) for i = 1,2,3. Which variable might carry the greatest 
weight in “naming” the common factor? Why? 


The eigenvalues and eigenvectors of the correlation matrix p in Exercise 9.1 are 
A; = 1.96, — e} = [.625, 593, 507] 
Az = .68, =e = [~.219, —.491, 843] 
A3= .36, e§ = [.749, -.638, -.177] 


(a) Assuming an m = 1 factor model, calculate the loading matrix L and matrix of 
specific variances Y using the principal component solution method. Compare the 
results with those in Exercise 9.1. 

(b) What proportion of the total population variance is explained by the first common factor? 

Given p and ¥ in Exercise 9.1 and an m = 1 factor model, calculate the reduced 


correlation matrix p = p — ¥ and the principal factor solution for the loading matrix L. 
Is the result consistent with the information in Exercise 9.1? Should it be? | 


Establish the inequality (9-19). 
Hint: Since S — LL’ — ¥ has zeros on the diagonal, 


(sum of squared entries of S — LL' - ¥) < (sum of squared entries of S ~ LL’) 


9.6. 


9.7. 


Exercises 53] 

Now, S~LL’ = Ags:€me18ing1 +71* + Ape pe), = Pio) A (2)P fay, where Pro) = [8 mir 189] 
and Ai) is the diagonal matrix with elements es care Ap: 

Use (sum of squared entries of A) = tr AA’ and tr [Pry A 2) A (2)Pfay] =tr [A2)Aqy]. 
Verify the following matrix identities. 
(a) (1+ LOL) Leon =1- (1+ Leb)! 
Hint: Premultiply both sides by (I + L'¥™'L). 
(b) (LL' + ¥)) = 8 BOL LL) Ly 
Hint: Postmultiply both sides by (LL’ + W) and use (a). 
(c) L(LL’ + ¥)' = (1+ L'¥ OL) Ly! 


Hint: Postmultiply the result in (b) by L, use (a), and take the transpose, noting that 
(LL' + ¥)', ¥!, and (I + L’¥“!L)~ are symmetric matrices. 


(The factor model parameterization need not be unique.) Let the factor model with 
Pp = 2and m = 1 prevail. Show that 


= 7 = Me 
o11 = €, + th, O12 = O21 = C14 €1 
022 = Bi + Ww 
and, for given 711,022, and oj 2, there is an infinity of choices for L and ¥. 


(Unique but improper solution: Heywood case.) 
Consider an m = 1 factor model for the population with covariance matrix 


1 4 9 
XY=|4 1 =7 
9 7 1 


Show that there is a unique choice of L and ¥ with Y = LL’ + W, but that wy; < 0,so 
the choice is not admissible. 


In a study of liquor preference in France, Stoetzel [14] collected preference rankings of 
p = 9 liquor types from n = 1442 individuals. A factor analysis of the 9 < 9 sample 
correlation matrix of rank orderings gave the following estimated loadings: 


Estimated factor loadings 


Variable (X) K F, F; 
Liquors 64 .02 -16 
Kirsch 50 —.06 —.10 
Mirabelle 46 —.24 -19 
Rum 17 .74 97* 
Marc —.29 66 -.39 
Whiskey ~,.29 —.08 .09 
Calvados — 49 .20 —.04 
Cognac —.52 —.03 42 
Armagnac —.60 —.17 14 


*This figure is too high. It exceeds the maximum value of .64, as a result 
of an approximation method for obtaining the estimated factor loadings 
used by Stoetzel. 


532 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices 


9.10, 


9.11. 


9.12. 


Given these results, Stoetzel concluded the following: The major principle of liquor pref. 

erence in France is the distinction between sweet and strong liquors. The second motj- 

vating element is price, which can be understood by remembering that liquor is both an 
expensive commodity and an item of conspicuous consumption. Except in the case of 
the two most popular and least expensive items (rum and marc), this second factor plays 

a much smaller role in producing preference judgments. The third factor concerns the 

sociological and primarily the regional, variability of the judgments. (See [14], p. 11.) 

(a) Given what you know about the various liquors involved, does Stoetzel’s interpreta. 
tion seem reasonable? 

(b) Plot the loading pairs for the first two factors. Conduct a graphical orthogonal rota- 
tion of the factor axes. Generate approximate rotated loadings. Interpret the rotated 
loadings for the first two factors. Does your interpretation agree with Stoetzel’s 
interpretation of these factors from the unrotated loadings? Explain. 


The correlation matrix for chicken-bone measurements (see Example 9.14) is 


1.000 

505 1.000 

569 .422 1.000 

602 467 .926 1.000 

621 482 .877 874 1.000 

603 .450 .878 894 .937 1.000 


The following estimated factor loadings were extracted by the maximum likelihood 
procedure: 


Varimax 

Estimated rotated estimated 

factor loadings factor loadings 

Variable FK, F, FY Fy 

1. Skull length 602 .200 484 411 
2. Skull breadth 467 154 375 319 
3. Femur length 926 .143 .603 17 
4. Tibia length 1.000 .000 519 855 
5. Humerus length 874 AT6 861 .499 
6. Ulna length 894 327 744 594 


Using the unrotated estimated factor loadings, obtain the maximum likelihood estimates 
of the following. 

(a) The specific variances. 

(b) The communalities. 

(c) The proportion of variance explained by each factor. 

(d) The residual matrix R — LL = ¥,. 

Refer to Exercise 9.10. Compute the value of the varimax criterion using both unrotated 
and rotated estimated factor loadings. Comment on the results. 


The covariance matrix for the logarithms of turtle measurements (see Example 8.4) is 
11.072 


S = 1077] 8.019 6.417 
8.160 6.005 6.773 


Exercises 533 


The following maximum likelihood estimates of the factor loadings for an m = 1 model 
were obtained: 


Estimated factor 


loadings 
Variable K, 
1. In(length) .1022 
2. In(width) 0752 


3. In(height) 0765 


Using the estimated factor loadings, obtain the maximum likelihood estimates of each of 
the following. 


(a) Specific variances. 
(b) Communialities. 
(c) Proportion of variance explained by the factor. 
(d) The residual matrix S, - LL’ - ¥. 
Hint: Convert § toS,. 
9.13, Refer to Exercise 9.12. Compute the test statistic in (9-39). Indicate why a test of 


Ho: = LL’ + VW (with m = 1) versus H;: % unrestricted cannot be carried out for 
this example. [See (9-40).] 


9.14. The maximum likelihood factor loading estimates are given in (9A-6) by 
i= #7RA” 
Verify, for this choice, that 
Dw L=A 


where A = A —ITisa diagonal matrix. 


9.15. Hirschey and Wichern [7] investigate the consistency, determinants, and uses of 
accounting and market-value measures of profitability. As part of their study, a factor 
analysis of accounting profit measures and market estimates of economic profits was 
conducted, The correlation matrix of accounting historical, accounting replacement, 
and market-value measures of profitability for a sample of firms operating in 1977 is as 


follows: 
Variable HRA HRE HRS RRA RRE RRS Q_- REV 

Historical return on assets, HRA 1.000 
Historical return on equity, HRE -738 1.000 
Historica} return on sales, HRS -731 520 1.000 
Replacement return on assets, RRA 828 688 652 1.000 
Replacement return on equity, RRE 681 831 513 887 1.000 
Replacement return on Sales, RRS 712 543 =.826 §=6©.867_— 692 1.000 
Market Q ratio,Q 625 322 .579 639 .419 .608 1.000 


Market relative excess value, REV 604 303 617 563 .352 610 .937 1.000 


534 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices 


The following rotated principal component estimates of factor loadings for an m = re 
factor model were obtained: es 


Estimated factor loadings 


Variable 3 7] 5, 
Historical return On assets A33 612 499 
Historical return on equity 125 892 234 
Historical return on sales ; 296 238 887° 
Replacement return on assets -406 7108 483 
Replacement return on equity 198 895 .283 
Replacement return on sales 331 414 789 
Market Q ratio .928 .160 294 
Market relative excess value 910 079 355 


Cumulative proportion 
of total variance explained 287 .628 908 


(a) Using the estimated factor loadings, determine the specific variances and communalities 

(b) Determine the residual matrix, R — L,L, — ¥,. Given this information and the 
cumulative proportion of total variance explained in the preceding table, does an 
m = 3 factor model appear appropriate for these data? 

(c) Assuming that estimated loadings less than .4 are small, interpret the three factors, 
Does it appear, for example, that market-value measures provide evidence of 
profitability distinct from that provided by accounting measures? Can you sepa-- 
rate accounting historical measures of profitability from accounting replacement 
measures? 


9.16. Verify that factor scores constructed according to (9-50) have sample mean vector 0 and - 
zero sample covariances. 


9.17. Refer to Example 9.12. Using the information in this example, evaluate (Liv) L)7. 
Note: Set the fourth diagonal element of ¥, to .01 so that #7! can be determined. 
Will the regression and generalized least squares methods for constructing factors scores 
for standardized stock price observations give nearly the same results? Hint: See equation 
(9-57) and the discussion following it. 


The following exercises require the use of a computer. 


9.18. Refer to Exercise 8.16 concerning the numbers of fish caught. 
(a) Using only the measurements x; — x4, obtain the principal component solution for 
factor models with m = 1 and m = 2 
(b) Using only the measurements x, — x4, obtain the maximum likelihood solution for 
factor models with m = 1 and m = 2 
(c) Rotate your solutions in Parts (a) and (b). Compare the solutions and comment on 
them. Interpret each factor. 
(d) Perform a factor analysis using the measurements x, — x. Determine a reasonable~ 
number of factors m, and compare the principal component and maximum likel=, 
hood solutions after rotation. Interpret the factors. 


9.19, A firm is attempting to evaluate the quality of its sales staff and is trying to find an ex~" 


amination or series of tests that may reveal the potential for good performance 10 saleag 


9.20. 


9.21. 


9.22. 


9.23. 


9.24. 


9.25. 


Exercises 535 


The firm has selected a random sample of 50 sales people and has evaluated each on 3 
measures of performance: growth of sales, profitability of sales, and new-account sales. 
These measures have been converted to a scale, on which 100 indicates “average” per- 
formance. Each of the 50 individuals took each of 4 tests, which purported to measure 
creativity, mechanical reasoning, abstract reasoning, and mathematical ability, respec- 
tively. The n = 50 observations on p = 7 variables are listed in Table 9.12 on page 536. 


(a) Assume an orthogonal factor model for the standardized variables Z; = 
(X; — w;)/Vo;;,i = 1,2,...,7. Obtain either the principal component solution or 
the maximum likelihood solution for m = 2 and m = 3 common factors. 


(b) Given your solution in (a), obtain the rotated loadings for m = 2 and m = 3. Com- 
pare the two sets of rotated loadings. Interpret the m = 2 and m = 3 factor solutions. 


(c) List the estimated communalities, specific variances, and LL’ + ¥ for the m = 2 
and m = 3 solutions. Compare the results, Which choice of m do you prefer at this 
point? Why? 

(d) Conducta test of Hp: & = LL’ + W versus H\:% # LL’ + W for bothm = 2 and 
m = 3 at the a = .01 level. With these results and those in Parts b and c, which 
choice of m appears to be the best? 


(e) Suppose a new salesperson, selected at random, obtains the test scores x’ = 
[x1,%2,--.,%7] = [110, 98, 105, 15,18, 12,35]. Calculate the salesperson’s factor 
score using the weighted least squares method and the regression method. 

Note: The components of x must be standardized using the sample means and vari- 
ances calculated from the original data. 

Using the air-pollution variables X,, X2, X5, and X¢ given in Table 1.5, generate the 

sample covariance matrix. 

(a) Obtain the principal component solution to a factor model with m = 1 and m = 2. 

(b) Find the maximum likelihood estimates of Land ¥ for m = 1 andm = 2. 


(c) Compare the factorization obtained by the principal component and maximum like- 
lihood methods. 


Perform a varimax rotation of both m = 2 solutions in Exercise 9.20. Interpret the re- 
sults. Are the principal component and maximum likelihood solutions consistent with 
each other? 


Refer to Exercise 9.20. 


(a) Calculate the factor scores from the m = 2 maximum likelihood estimates by 
(i) weighted least squares in (9-50) and (ii) the regression approach of (9-58). 


(b) Find the factor scores from the principal component solution, using (9-51). 
(c) Compare the three sets of factor scores. 


Repeat Exercise 9.20, starting from the sample correlation matrix. Interpret the factors 
for the m = 1 and m = 2 solutions. Does it make a difference if R, rather than §, is 
factored? Explain. 


Perform a factor analysis of the census-tract data in Table 8.5. Start with R and obtain 
both the maximum likelihood and principal component solutions. Comment on your 
choice of m. Your analysis should include factor rotation and the computation of factor 
scores. 


Perform a factor analysis of the “stiffness” measurements given in Table 4.3 and dis- 
cussed in Example 4.14. Compute factor scores, and check for outliers in the data. Use 
the sample covariance matrix S. 


536 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices 


Table 9.12 Salespeople Data 


Index of: 

Sales 

Sales profit- 

Salesperson | growth ability 
| (x1) (x2) 

1 93.0 96.0 
2 88.8 91.8 
3 95.0 100.3 
4 101.3 103.8 
5 102.0 107.8 
6 95.8 97.5 
7 95.5 99.5 
8 110.8 122.0 
9 102.8 108.3 
10 106.8 120.5 
11 103.3 109.8 
12 99.5 111.8 
13 103.5 112.5 
‘14 99.5 105.5 
15 100.0 107.0 
16 81.5 93.5 
17 101.3 105.3 
18 103.3 110.8 
19 95.3 104.3 
20 99.5 105.3 
21 88.5 95.3 
22 99.3 115.0 
23 87.5 92.5 
24 105.3 114.0 
25 107.0 121.0 
26 93.3 102.0 
27 106.8 118.0 
28 106.8 120.0 
29 92.3 90.8 
30 106.3 121.0 
31 106.0 119.5 
32 88.3 92.8 
33 96.0 103.3 
34 94.3 94.5 
35 106.5 121.5 
36 106.5 115.5 
37 92.0 99.5 
38 102.0 99.8 
39 108.3 122.3 
40 106.8 119.0 
41 102.5 109.3 
42 92.5 102.5 
43 102.8 113.8 
44 83.3 87.3 
45 94.8 101.8 
46 103.5 112.0 
47 89.5 96.0 
48 84.3 89.8 
49 104.3 109.5 
50 106.0 118.5 


New- 
account 


Creativity 
test 


Score on: 


Mechanical Abstract Mathe- 
reasoning 


reasoning matics 
test 


9.26 


9.27. 


9.28. 


9.29. 


9.30. 


9.31. 


9.32. 


Exercises 537 


. Consider the mice-weight data in Example 8.6. Start with the sample covariance matrix. 
(See Exercise 8.15 for V's;;.) 


(a) Obtain the principal component solution to the factor model with m = 1 and 
m=2 

(b) Find the maximum likelihood estimates of the loadings and specific variances for 
m = Land m = 2. 

(c) Perform a varimax rotation of the solutions in Parts a and b. 


Repeat Exercise 9.26 by factoring R instead of the sample covariance matrix S. Also, for 
the mouse with standardized weights [.8, —.2, —.6, 1.5], obtain the factor scores using 
the maximum likelihood estimates of the loadings and Equation (9-58). 


Perform a factor analysis of the national track records for women given in Table 1.9. Use 
the sample covariance matrix § and interpret the factors. Compute factor scores, and 
check for outliers in the data. Repeat the analysis with the sample correlation matrix R. 
Does it make a difference if R, rather than S, is factored? Explain. 


Refer to Exercise 9.28. Convert the national track records for women to speeds mea- 
sured in meters per second. (See Exercise 8.19.) Perform a factor analysis of the speed 
data. Use the sample covariance matrix § and interpret the factors. Compute factor 
scores, and check for outliers in the data. Repeat the analysis with the sample correlation 
matrix R. Does it make a difference if R, rather than S, is factored? Explain. Compare 
your results with the results in Exercise 9.28. Which analysis do you prefer? Why? 


Perform a factor analysis of the national track records for men given in Table 8.6. Repeat 
the steps given in Exercise 9.28. Is the appropriate factor model for the men’s data dif- 
ferent from the one for the women’s data? If not, are the interpretations of the factors 
roughly the same? If the models are different, explain the differences. 


Refer to Exercise 9.30. Convert the national track records for men to speeds measured 
in meters per second. (See Exercise 8.21.) Perform a factor analysis of the speed data. 
Use the sample covariance matrix S and interpret the factors. Compute factor scores, 
and check for outliers in the data. Repeat the analysis with the sample correlation matrix 
R. Does it make a difference if R, rather than S, is factored? Explain. Compare your re- 
sults with the results in Exercise 9.30. Which analysis do you prefer? Why? 


Perform a factor analysis of the data on bulls given in Table 1.10. Use the seven variables 
YrH¢t, FtFrBody, PrctFFB, Frame, BkFat, SaleHt, and SaleWt. Factor the sample covari- 
ance matrix § and interpret the factors. Compute factor scores, and check for outliers. 


_ Repeat the analysis with the sample correlation matrix R. Compare the results obtained 


9.33. 


9.34. 


from § with the results from R. Does it make a difference if R, rather than S, is factored? 
Explain. 


Perform a factor analysis of the psychological profile data in Table 4.6. Use the sample 
correlation matrix R constructed from measurements on the five variables, Indep, Supp, 
Benev, Conform and Leader. Obtain both the principal component and maximum likeli- 
hood solutions for 72 = 2 and m = 3 factors Can you interpret the factors? Your analy- 
sis should include factor rotation and the computation of factor scores. 

Note: Be aware that a maximum likelihood solution may result in a Heywood case. 


The pulp and paper properties data are given in Table 7.7. Perform a factor analysis 
using Observations on the four paper property variables, BL, EM, SF, and BS and the 
sample correlation matrix R. Can the information in these data be summarized by a 
single factor? If so, can you interpret the factor? Try both the principal component and 
maximum likelihood solution methods, Repeat this analysis with the sample covariance 
matrix §. Does your interpretation of the factor(s) change if S rather than R is 
factored? 


538 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices 


9.35. 


9.36. 


References 


14. 


15. 


Repeat Exercise 9.34 using observations on the pulp fiber characteristic variables AFL, 
LFF, FFF, and ZST. Can these data be summarized by a single factor? Explain. 


Factor analyze the Mali family farm data in Table 8.7. Use the sample correlation matrix 
R. Try both the principal component and maximum likelihood solution methods fo, 
m = 3, 4, and 5 factors. Can you interpret the factors? Justify your choice of m. Your 
analysis should include factor rotation and the computation of factor scores. Can yoy 
identify any outliers in these data? 


E ee 


. Anderson, T. W. An Introduction to Multivariate Statistical Analysis (3rd ed.). New York: 


John Wiley, 2003. 


. Bartlett, M. S. “The Statistical Conception of Mental Factors.” British Journal of 


Psychology, 28 (1937), 97-104. 


. Bartlett, M. S. “A Note on Multiplying Factors for Various Chi-Squared Approxima- 


tions.” Journal of the Royal Statistical Society (B) 16 (1954), 296-298. 


. Dixon, W. S. Statistical Software Manual to Accompany BMDP Release 7/version 7.0 


(paperback). Berkeley, CA: University of California Press, 1992. 


. Dunn, L. C. “The Effect of Inbreeding on the Bones of the Fow}.” Storrs Agricultural 


Experimental Station Bulletin, 82 (1928), 1-112. 


. Harmon, H. H. Modern Factor Analysis (3rd ed.). Chicago: The University of Chicago 


Press, 1976. 


. Hirschey, M., and D. W. Wichern. “Accounting and Market-Value Measures of Profitability: 


Consistency, Determinants and Uses.” Journal of Business and Economic Statistics, 2, no.4 
(1984), 375-383. 


. Joreskog, K. G. “Factor Analysis by Least Squares and Maximum Likelihood.” In Statis- 


tical Methods for Digital Computers, edited by K. EnsJein, A. Ralston, and H. S. Wilf. New 
York: John Wiley, 1975. 


. Kaiser, H.F. “The Varimax Criterion for Analytic Rotation in Factor Analysis.” Psychome- 


trika, 23 (1958), 187-200. 


. Lawley, D. N., and A. E. Maxwell. Factor Analysis as a Statistical Method (2nd ed.). 


New York: American Elsevier Publishing Co., 1971. 


. Linden, M.“‘A Factor Analytic Study of Olympic Decathlon Data.” Research Quarterly, 


48, no. 3 (1977), 562-568. 


. Maxwell, A. E. Multivariate Analysis in Behavioral Research. London: Chapman and 


Hall, 1977. 


. Morrison, D. E Multivariate Statistical Methods (4th ed.). Belmont, CA: Brooks/Cole 


Thompson Learning, 2005. 

Stoetzel, J.“A Factor Analysis of Liquor Preference.” Journal of Advertising Research,1 
(1960), 7-11. 

Wright, S. “The Interpretation of Multivariate Systems.” In Statistics and Mathematics in 
Biology, edited by O. Kempthome and others. Ames, IA: Iowa State University Press, 
1954, 11-33, 


CANONICAL CORRELATION 
ANALYSIS 


10.1 


Introduction 


Canonical correlation analysis seeks to identify and quantify the associations 
between two sets of variables. H. Hotelling ((5], [6]), who initially developed 
the technique, provided the example of relating arithmetic speed and arithmetic 
power to reading speed and reading power. (See Exercise 10.9.) Other examples 
include relating governmental policy variables with economic goal variables and 
relating college “performance” variables with precollege “achievement” variables. 

Canonical correlation analysis focuses on the correlation between a linear 
combination of the variables in one set and a linear combination of the variables in 
another set. The idea is first to determine the pair of linear combinations having 
the largest correlation. Next, we determine the pair of linear combinations having 
the largest correlation among all pairs uncorrelated with the initially selected pair, 
and so on. The pairs of linear combinations are called the canonical variables, and 
their correlations are called canonical correlations. 

The canonical correlations measure the strength of association between the two 
sets of variables. The maximization aspect of the technique represents an attempt to 
concentrate a high-dimensional relationship between two sets of variables into a 
few pairs of canonical variables. 


10.2 Canonical Variates and Canonical Correlations 


We shall be interested in measures of association between two groups of variables. 
The first group, of p variables, is represented by the (p x 1) random vector x"), The 
second group, of q variables, is represented by the (g X 1) random vector X, We 
assume, in the theoretical development, that x) represents the smaller set, so that 


p= 4. 
539 


540 Chapter 10 Canonical Correlation Analysis 


For the random vectors X@) and X), let 
E(X®) =p; — Cov(X) = Ey, 
E(X"}) =p: Cov (X)) = 355 (10-1) 
Cov(X, X@) = Zy2 = Zh 


It will be convenient to consider X@) and X® jointly, so, using results (2-38) 
through (2-40) and (10-1), we find that the random vector 


x afte |Z < 
(ptaxry |X) x?) oem) 


has mean vector 


E(x") (1) 
ieee Eee) = be (10-3) 


and covariance matrix 


x = E(X - X — p)’ 
(p+4)x(p+q) ( my #) 
EK] = p(x — py eX — pO) (xX — yy! 


= eae te ets) | 
21 22 


(pxp)! (pXq) 
By iis Bat (10-4) 


21: 222 
(9X) } (qXq) 


The covariances between pairs of variables from different sets—one variable 
from X"), one variable from X‘?)—are contained in £,> or, equivalently, in Z2;. 
That is, the pq elements of X,2 measure the association between the two sets. When 
p and q are relatively large, interpreting the elements of 21 collectively is ordinari- 
ly hopeless. Moreover, it is often linear combinations of variables that are interest- 
ing and useful for predictive or comparative purposes. The main task of canonical 
correlation analysis is to summarize the associations between the X“) and X) sets 
in terms of a few carefully chosen covariances (or correlations) rather than the pq 
covariances in 249. 


Canonical Variates and Canonical Correlations 541 


Linear combinations provide simple summary measures of a set of variables. Set 
U =a X) 
V = bx?) - (10-5) 
for some pair of coefficient vectors a and b. Then, using (10-5) and (2-45), we obtain 
Var (U) = a! Cov(X))a = aX), a 
Var(V) = b’ Cov(X®))b = b’Z2,b (10-6) 
Cov(U,V) = a’ Cov(X"™), X))b = a’, 2b 


We shall seek coefficient vectors a and b such that 


a'Zi2b 
Corr (U,V) = ————"- (10-7) 
V a’X,,a V b’X,2b 
is as large as possible. 
We define the following: 


The first pair of canonical variables, or first canonical variate pair, is the pair of linear 
combinations U,, V, having unit variances, which maximize the correlation (10-7); 
The second pair of canonical variables, or second canonical variate pair, is the pair 
of linear combinations U2, Vz having unit variances, which maximize the correla- 
tion (10-7) among all choices that are uncorrelated with the first pair of canonical 
variables. 


At the kth step, 


The kth pair of canonical variables, or kth canonical variate pair, is the pair of 
linear combinations U,, V; having unit variances, which maximize the correla- 
tion (10-7) among all choices uncorrelated with the previous k — 1 canonical 
variable pairs. 


The correlation between the kth pair of canonical variables is called the kth canonical 
correlation. 

The following result gives the necessary details for obtaining the canonical 
variables and their correlations. 


Result 10.1. Suppose p = q and let the random vectors x and x have 
q 
Cov(X")) = ou Cov (X)) = 222, and Cov(X"), X)) = 1. , where & has full 
(pq) 
rank. For coetinieat vectors a ae b , form the linear combinations U = a'X“) 


(px1) (qx 
and V = b'X). Then 
max Corr (U,V) = p; 
a,b 


attained by the linear combinations (first canonical variate pair) 


U, =e 277?X and VY = £5 253/x®) 
evs ew 
aj bi 


542 Chapter 10 Canonical Correlation Analysis 


The Ath pair of canonical variates, k = 2, 3,..., p, 
U, = ee Sy}?K) X= FEAPX®) 
maximizes 
Corr (Ux, Vi) = Pe 


among those linear combinations uncorrelated with the preceding 1, Zewes 
canonical variables. 
* 
Here pi? = pres py are the eigenvalues of 2S raBe Ear” 
1, 2,---» &, are the associated (p x 1) cement [The Gene PL, PY, ...-pe 
are also the p largest eigenvalues of the matrix Dt kp ep ae 122272 with corresponds 


ing (q X 1) eigenvectors f,, f,..., £,. Each fj is proportional to X74", 27}e, 2 
The canonical variates have the properties 


Var (U,) = Var(V,) =1 
Cov (U,,Ue) = Corr(U,,Ue) =0 k#E 
Cov (Vi, V:) = Corr(Y%, Vi) =0 k#e 
Cov(U,, Ve) = Corr(Uy,Ve) =0 k#e 
fork, = 1,2,..., p. 


Proof. (See website: www.prenhall.com/statistics) 


If the original variables are standardized with Z“) = [Z{?, Z)),..., 2] « 
Z =(Z Lae A ene Zev” from first principles, the ied variates are of the fo 
Ux = af Z) = ei py?Z 


Y= bi. Z) = fi p3?Z2 


Here, Cov(Z")) = py,, Cov(Z) = p22, Cov(Z™, Z) = pir = ph, and-ef 


and F, are the eigenvectors of Pit” P12 P32 Po pit? aud PtP PPPS . 
respectively. The canonical correlations, px, satisfy 


Corr (Ux; Ve) = pis k =1,2,... 


where pr = pez = Pp are the nonzero eigenvalues of the matrix 
Pil’ PP: Pri pit” (or, equivalently, the largest eigenvalues of Prt"PaiPit 
pre f), 


Comment. Notice that 


t (1 
a(X) — pw) = ayy(XS) — wh?) + age X$? - wf?) 


1 
tot eh? ~ is’) 


(1) (1) qd) _ ,,() 
(Xi - #1) (Xo — #2) 
= ap, V + 89 V 922 
k1VOi1 Vor k2 22 Vor2 
(1) (1) 
(X5 ~ Bp ) 
Panes: of CipN pp age 


PP 


Canonical Variates and Canonical Correlations 543 
where Var( X ()) = 0;;,i = 1,2,..., p. Therefore, the canonical coefficients for the 
standardized variables, Zz) =(X (1) rt ply) Voj;, are simply related to the canon- 
ical coefficients attached to the original variables X; ’. Specifically, if aj, is the coeffi- 
cient vector for the kth canonical variate U,, then a, V}/ is the coefficient vector for 
the kth canonical variate constructed from the standardized variables Z'). Here viz 
is the diagonal matrix with ith diagonal element Vo;;. Similarly, bi, V4 is the coeffi- 
cient vector for the canonical variate constructed from the set of standardized vari- 
ables Z). In this case V}# is the diagonal matrix with ith diagonal element Voi; = 

V Var (X (2)) The canonical correlations are unchanged by the standardization. 
However, the choice of the coefficient vectors ay, b, will not be unique if pj,” = Per 

The relationship between the canonical coefficients of the standardized vari- 
ables and the canonical coefficients of the original variables follows from the special 
structure of the matrix [see also (10-11)] 


1/2 = a = = = 

E717E12E92 221211? or Pil! P12 P23 P21 pit” 
and, in this book, is unique to canonical correlation analysis. For example, in princi- 
pal component analysis, if aj, is the coefficient vector for the kth principal compo- 


nent obtained from &, then a,(X — wx) = a,V"/Z, but we cannot infer that a, vi? 
is the coefficient vector for the kth principal component derived from p. 


Example 10.1 (Calculating canonical variates and canonical correlations for stan- 
dardized variables) Suppose Z‘!) = [Z?.ze ]' are standardized variables and 


Z = [Zz ZY are also standardized variables. Let Z = [Z), Z]' and 
10 4: 5 6 
4 


! 4.101 3 4 
5 34. 32 
6 4: 2 1.0 
Then 
-ip _| 1.0681 —.2229 
Pu 2229 1.0681 
x 1.0417 ~.2083 
P22 2083 1.0417 
and 


4371 .2178 
.2178 .1096 


Pit P12P23P21 pik? = | 


The eigenvalues, p}”, py’, of Pj!" p,.p34 21 p71)” are obtained from 


_ |4371-A 2178 


= = 2 ches : 
e 2178 1096 — a| ~ 64371 ~ A) (1096 — A) ~ (2.178) 


= 2 — 5467A + .0005 


544 Chapter 10 Canonical Correlation Analysis 


yielding p;2 = .5458 and p;” = .0009. The eigenvector e; follows from the vector 
equation 


4371 .2178 
ee ee SAS ENE 


Thus, e| = [.8947, .4466] and 


P 8561 
ay = Pil? = Ea 


From Result 10.1, f; & p72)” pr, pi¥? e; and b; = p"f,. Consequently, 
3959 .2292 | | .8561 4026 
b 1 = = 
Le Peer Ee | Ex bal 
We must scale b; so that 


Var (VY) = Var (b}Z) = bi p2b, = 1 
The vector [.4026, .5443]' gives 


(ans, saas]] "2 2] ame 
Using V.5460 = .7389, we take 
ee pe 2 be 
.7389 | .5443 .7366 
The first pair of canonical variates is 


U, = a2 = g6Z{) + 28z$0) 
V, = bj}Z® = .54zZ)?) + 7422) 


5460 


wn 


and their canonical correlation is 
= Vopr = V.5458 = .74 


This is the largest correlation possible between linear combinations of variables 
from the Z) and Z) sets. 

The second canonical correlation, p) = V.0009 = .03, is very small, and conse- 
quently, the second pair of canonical variates, although uncorrelated with members of 
the first pair, conveys very little information about the association between sets. (The 
calculation of the second pair of canonical variates is considered in Exercise 10.5.) 

We note that U, and V,, apart from a scale change, are not much different from 
the pair 

(1) 
U, = a’Z") = [3,1] [2 |- = 3Z9 + 2) 


- Zi 2 2 
B= b’Z = [1,1] } Fi | = 22) + z&) 
2 


Interpreting the Population Canonical Variables 545 


For these variates, 
Var (U;) = a’ pi, a = 12.4 
Var (V,) = b’ Pa 2b = 2.4 
Cov(U,, V,) = a’ P12b = 4.0 


Ul 


and 
~ 4.0 
Corr (U,, Vi) = —= = = ..73 
(CY) = Fira Via 
The correlation between the rather simple and, perhaps, easily interpretable linear 
combinations U,, V;, is almost the maximum value p; = .74. = 


The procedure for obtaining the canonical variates presented in Result 10.1 has 
certain advantages. The symmetric matrices, whose eigenvectors determine the 
canonical coefficients, are readily handled by computer routines. Moreover, writing 
the coefficient vectors as a, = 2j}/*e, and b, = ¥74/"f, facilitates analytic descrip- 
tions and their geometric interpretations. To ease the computational burden, many 
people prefer to get the canonical correlations from the eigenvalue equation 


| Xi}Z12%79201 — pl = 0 (10-10) 
The coefficient vectors a and b follow directly from the eigenvector equations 
2712122222218 = pa 
©2)%21 21] 212b = pb (10-11) 


The matrices £]}E 22232, and £74E2,2;}%12 are, in general, not symmetric. (See 
Exercise 10.4 for more details.) 


10.3 Interpreting the Population Canonical Variables 


Canonical variables are, in general, artificial. That is, they have no physical meaning. 
If the original variables X‘) and X°) are used, the canonical coefficients a and b 
have units proportional to those of the X"!) and X() sets. If the original variables 
are standardized to have zero means and unit variances, the canonical coefficients 
have no units of measurement, and they must be interpreted in terms of the stan- 
dardized variables. 

Result 10.1 gives the technical definitions of the canonical variables and canon- 
ical correlations. In this section, we concentrate on interpreting these quantities. 


Identifying the Canonical Variables 


Even though the canonical variables are artificial, they can often be “identified” 
in terms of the subject-matter variables. Many times this identification is aided 
by computing the correlations between the canonical variates and the original 
variables. These correlations, however, must be interpreted with caution. They 
provide only univariate information, in the sense that they do not indicate how the 
original variables contribute jointly to the canonical analyses. (See, for example, [11].) 


546 Chapter 10 Canonical Correlation Analysis 


For this reason, many investigators prefer to assess the contributions of the Original 
variables directly from the standardized coefficients (10-8). 


Let A = ,%,---,4,|' and B = [b,,b,...,b,]’, so that the 
et A. [a,, a | iy [bj, bz a vectors of 


canonical variables are 


= Ax) = Bx) 
ey am Gt) ic (10-12) 


where we are primarily interested in the first p canonical variables in V. Then 


Cov(U, X) = Cov(AX, x) = AX), (10-13) 


Because Var(U;) = 1, Corr(U;, X$) is obtained by dividing Cov (U;, X¢) by 


Vvar( x!) = off. Equivalently, Corr(U;, X{) = Cov(U;, o7¥2X\). Intro. 
ducing the (p X p) diagonal matrix Vi} with kth diagonal element oY, 


we have, in matrix terms, 
Pux = Cor (U,X™) = Cov(U, Vi}#?2K) = Cov(AX), Vi}2x()) 
(pXp) 
= AX, Vii? 
Similar calculations for the pairs (U, X?)), (V, KX’) and (V, X") yield 


Pu.x = AZiVil? = — Pv,x@) = BY 2 Vz)” 


(pXp) (qq) (10-14) 
Pu,x® = AX12V33 Py x = BY, Vj}? 
(pxq) (qxp) 


where V3} is the (q X q) diagonal matrix with ith diagonal element {Var(X 2). 
Canonica] variables derived from standardized variables are sometimes inter- 
preted by computing the correlations. Thus, 


Pu. = ArPir Pv,z = B, P22 
Pu.2z®=AiPir2 — Py,z") = BL Pr; (10-15) 
where A, and B, are the matrices whose rows contain the canonical coefficients 


(pXp) (q¢xq) 
for the Z) and Z) sets, respectively. The correlations in the matrices displayed 


in (10-15) have the samme numerical values as those appearing in (10-14); that is, 
Pu,x! = Py.z, and so forth. This follows because, for example, Py,x() = 
AX Vj)? = AVI2V7}721,Vi1? = A, Pi. = Pu.z0). The correlations are unaf- 
fected by the standardization. 


Example 10.2 (Computing correlations between canonical variates and their compo- 
nent variables) Compute the correlations between the first pair of canonical variates 
and their component variables for the situation considered in Example 10.1. 

The variables in Example 10.1 are already standardized, so equation (10-15) is 
applicable. For the standardized variables, 


10 4 10 2 
pu=["2 if Pa =| 2 a 


Interpreting the Population Canonical Variables 547 


and 
_{[5 6 
Beil Be a 
With p = 1, 
A, = [.86, .28] B, = [.54, .74] 
so 
1: A 
Pu,,z = Az, Pi, = [-86,.28] a = [.97, .62] 
, 4 1.0 
and 


10 2 
Py,,z = B, P22 — [.54, 74] | 2 al = [.69, 85] 


We conclude that, of the two variables in the set 20), the first is most closely 
associated with the canonical variate U,. Of the two variables in the set 2), 
the second is most closely associated with V,. In this case, the correlations reinforce 
the information supplied by the standardized coefficients A, and B,. However, the 
correlations elevate the relative importance of zy? in the first set and Z{?) in 
the second set because they ignore the contribution of the remaining variable 
in each set. : 

From (10-15), we also obtain the correlations 


Pu, = Ay Pr2 = [.86, 28] E | = [.51,.63] 


and 
ot 5 3 
Py, = B, P21 = B, Pi2 = [.54, 74] 6-4 = [.71, .46] 


Later, in our discussion of the sample canonical variates, we shall comment on 
the interpretation of these last correlations. = 


The correlations Py,x) and Py,x() can help supply meanings for the canonical 
variates. The spirit is the same as in principal component analysis when the correla- 
tions between the principal components and their associated variables may provide 
subject-matter interpretations for the components. 


Canonical Correlations as Generalizations 
of Other Correlation Coefficients 


First, the canonical correlation generalizes the correlation between two variables. 
When X") and X) each consist of a single variable, so that p = q = 1, 


| Corr (XX? x)} = | Corr (aX$?, bx} for all a,b # 0 


548 Chapter 10 Canonical Correlation Analysis 


Therefore, the “canonical variates” U, = X a) and V, = X (2) have correlation 
i= | Corr (X1, x)). When X@) and X@) have more components, setting 

= [0,..., 0,1, 0,...,0] with 1 in the ith position and b’ = [0,...,0,1, 0,..., 0] 
with 1 in the kth position yields 


| Corr (X!?,X?))| = | Corr(a’X™, b’X)| 


< max Corr(a’X) b’'X@)) = (10-16) 
That is, the first canonical eomelaton.! is larger than the absolute value of any entry 
in Py2 = Vit? 212V22”. 

Second, the multiple correlation coefficient ;(x(2) [see (7-48)] is a special case 
of a canonical correlation when X“) has thesingle element X' . (p = 1). Recall that 


P(x) = max Corr (X () px?) =p} for p=1 (10-17) 


When p > 1, pi is larger than each of the multiple correlations of X!” with X@) or 
the multiple correlations of X (2) with X“), 
Finally, we note that 


pu ,(x) = max Corr (U,, b'X) = Corr (Uy, Vi) = ph, (10-18) 

k =1,2,...,p 
from the proof of Result 10.1 (see website: www prenhall.com/statistics). Similarly, 
Py yx) = max Corr (aX), Vy) = Corr (Ug, Ve) = pk, (10-19) 

k = 1,2,...,p 


That is, the canonical correlations are also the multiple correlation coefficients of U; 
with X") or the multiple correlation coefficients of V, with X“), 

Because of its multiple correlation coefficient interpretation, the kth squared 
canonical correlation p;,’ is the proportion of the variance of canonical variate U, 
“explained” by the set X°). It is also the proportion of the variance of canonical 
variate V, “explained” by the set X"), Therefore, p;” is often called the shared vari- 
ance between the two sets X) and X'). The largest value, p;”, is sometimes regard- 
ed as a measure of set “overlap.” 


The First r Canonical Variables as a Summary of Variability 


The change of coordinates from X“) to U = AX) and from X") to V = BX”) is 
chosen to maximize Corr (U,, V;) and, successively, Corr (U;, V;), where (U;, V;) have 
zero correlation with the previous pairs (U,,V,), (U2, V2),..., (Uj-1, Vi-1). Cor- 
relation between the sets X“!) and X') has been isolated in the pairs of canonical 
variables 

By design, the coefficient vectors a;,b; are selected to maximize correlations, 
not necessarily to provide variables that (approximately) account for the subset 
covariances %,, and Ly . When the first few pairs of canonical variables provide 
poor summaries of the variability in X,, and X22, it is not clear how a high canonical 
correlation should be interpreted. 


Interpreting the Population Canonical Variables 549 


Example 10.3 (Canonical correlation as a poor summary of variability) Consider the 
covariance matrix 


xf? 10 0 {0 0 
cov} |-X2..] | = Eas ee te 
x?) Ya i X22 0 .95 1 0 
x2 0 0 {0 100 


The reader may verify (see Exercise 10.1) that the first pair of canonical variates 
U, = X g) andV, = X (2) has correlation 


pi = Corr(U,, V4) = .95 


YetU, = X ) provides a very poor summary of the variability in the first set. Most 
of the variability in this set isin X - , which is uncorrelated with U,. The same situ- 
ation is true for V, = x! ) in the second set. = 


A Geometrical Interpretation of the Population Canonical 
Correlation Analysis 
A geometrical interpretation of the procedure for selecting canonical variables 
provides some valuable insights into the nature of a canonical correlation analysis. 
The transformation 
U = AX) 
from X") to U gives 


Cov(U) = AXA‘ =I] 


From Result 10.1 and (2-22), A = E’%;}/ = E'P, Aj!” P} where E’ is an orthogonal 
matrix with row e/, and £1, = P,A,P;. Now, P| X") is the set of principal compo- 
nents derived from X“) alone. The matrix Aj’/? Pj X“!) has ith row (1/ V/A;) px), 
which is the ith principal component scaled to have unit variance. That is, 


Cov (Ay!?PiX) = Aj? P1d, PA)? = Ay? PP, Ay PyP, Az? 


= Aj’?A, Aj? =1 


Consequently, U = AX") = E’P,A;!?P{x") can be interpreted as (1) a 
transformation of X") to uncorrelated standardized principal components, fol- 
lowed by (2) a rigid (orthogonal) rotation P, determined by X,; and then (3) an- 
other rotation E’ determined from the full covariance matrix Y. A similar 
interpretation applies to V = BX"), 


550 Chapter 10 Canonical Correlation Analysis 


10.4 The Sample Canonical Variates and Sample 
Canonical Correlations 


A random sample of n observations on each of the (p + q) variables X)  X(2) can. 
be assembled into the n X (p + q) data matrix 


X= [x ; x 


1 1 2 2 2) 
xi x) rae xip | x xy ee xf xiDr tb y(2)s 
x) oO. vp xt 
ly Gs ah. a6 aga Be tee eel 2a oe (10-20) 
a a) | yi oy | af LxfPrp x?) 
Xn iad nae x iXn1 Xn2 "*' Ang 


x) 12 
x = EA where x) =— > x{!) 


= 


n 
2) - 1 SV (2) 
x) = Dx (10-21) 
£ 


Similarly, the sample covariance matrix can be arranged analogous to the represen- 
tation (10-4). Thus, 


Si} Si 

(exp) | (eX4) 

Ss ed eee fesecaseoee 
(p+9)x(p+q) Sy i Soo 


where 


Ski = 


> (xl? — eK) (x — x), kj =1,2 (10-22) 
n-1 si 


The linear combinations 

0 =8x). Y= bx?) (10-23) 
have sample correlation [see (3-36)] 

pacetc FESR oe (10-24) 


n = ~ aA 
ee Va'S,, 4 Vb'Sy2b 


The first pair of sample canonical variates is the pair of linear combinations 
U; . V, having unit sample variances that maximize the ratio (10-24). 

._Ingeneral, the kth pair of sample canonical variates is the pair of linear combinations 
U,, V;, having unit sample variances that maximize the ratio (10-24) among those linear 
combinations uncorrelated with the previous k — 1 sample canonical variates. 

The sample correlation between U, and V; is called the kth sample canonical 
correlation. 

The sample canonical variates and the sample canonical correlations can be 
obtained from the sample covariance matrices §,;, 8,. = $31, and S$, in a manner 
consistent with the population case described in Result 10.1. 


The Sample Canonical Variates and Sample Canonical Correlations 551 


Result 10.2. Let pe = p= are pe be the p ordered eigenvalues of 
$718, 2874S, $71" with corresponding eigenvectors €1, é2,...,é, where the S,; are 
defined in (10-22) and p=<g. Let f,,6,..., f, be the eigenvectors of $7982 iSit 
$1283)", where the first Pp fs may be obtained from f,= (1/pi) $74/?S2S7}/e €, 
k =1,2,..., p. Then the kth sample canonical variate pair! is 


Oy = &S)}2x() Vv; = f,S5)/2x(2) 
ee ee, ane 
ah bj 


where x") and x?) are the values of the variables X"!) and X) for a particular 
experimental unit. Also, the first sample canonical variate pair has the maximum 
sample correlation 


TO.0, = PA 
and for the kth pair, 
Ti,V_ = Ph 


is the largest possible correlation among linear combinations uncorrelated with the 
preceding k — 1 sample. canonical variates. 
The quantities pj, p3,..., p; are the sample canonical correlations.” 


Proof. The proof of this result follows the proof of Result 10.1, with S,, substituted 
for X,),k, 1 = 1,2. = 


The sample canonical variates have unit sample variances 


Site = Ute = 1 (10-25) 
and their sample correlations are 
"G,.0; = V4.0. = 9, k#E€ 
1H,.¥, = 9, k#€ (10-26) 


The interpretation of Up, Vi is often aided by computing the sample correlations be- 
tween the canonical variates and the variables in the sets K"!) and X@). We define 
the matrices 

A = (A, 4,..., 4,]' = [b,, b),...,b,]' 10-27) 
aa [a1, a2 | a [b,, by a] ( 
whose rows are the coefficient vectors for the sample canonical variates.? Analogous 
to (10-12), we have 


U =Ax) Vo = Bx) (10-28) 
(pX1) (4x1) 


1 When the distribution is normal, the mesma likelihood method can be employed using Ze S, 
in place of S. The sample canonical correlations p{ are, therefore, the maximum likelihood estimates of 


px and Vn/(n — 1) @&, Vn/(n - 1) by, are the maximum likelihood estimates o of a; and b,, respectively. 
2If p> rank(Si2) = P}, the nonzero sample canonical correlations are pi,-.., pi: 
3 The vectors 6,41 = $73” fp41.b pt? = $3)" fp,a2,--- by = sit, are Seteained from a choice of 
the last g — p) mutually orthogonal eigenvectors f associated with the zero eigenvalue of S23/*S2 S718, 287}. 


552 Chapter 10 Canonical Correlation Analysis 


and we can define 


Ry,,© = matrix of sample correlations of U with x) 


Ry, = matrix of sample correlations of V with x‘) 
Rg,,® = matrix of sample correlations of U with x”) 


Ry, 0 = matrix of sample correlations of V with x“) 
Corresponding to (10-19), we have 
RG. = AS, Dj}? , 
Ry,,@ = BS, D5}? 
Ry, = AS 2Dz}? (10-29) 
Ry xo = BS, D712 


where Dj}? is the (p x p) diagonal matrix with ith diagonal element (sample 
var(x!)) )) V? and D3}? is the (q X q) diagonal matrix with ith diagonal element 
(sample, var(x?))¥2, 


‘Comment. If the observations are standardized [see (8-25)], the data matrix 


becomes 
2{1)'| (2) 
Z=(Z9; Ze]=] 25: 
ana 
and the sample canonical variates become 
U =A Vv =B,20) (10-30) 
(pX1) (qx1) 


where A, = = AD}? and B, = BDZ. The sample canonical correlations are unaffect- 
ed by the standardization. The correlations displayed in (10-29) remain unchanged 
and may be calculated, for standardized observations, by substituting A, for 
A, B, for B, and R for $.Note that Dj}? = 1 andD5?= 1 for standardized 
observations. {pXp) (9x4) 


Example 10.4 (Canonical correlation analysis of the chicken-bone data) In Example 
9.14, data consisting of bone and skull measurements of white leghormm fowl were 
described. From this example, the chicken-bone measurements for 


X{" = skull length 


d(x); 
Head (X‘’) { XS) = skull breadth 


xX (2) = femur length 


Leg (X®): 
BURP V2) dba length 


The Sample Canonical Variates and Sample Canonical Correlations 553 


have the sample correlation matrix 


1.0 505} 569 602 
ie [eRe | 90510! 422467 
R2, :R22 569 422 :1.0 .926 

602 467} 926 1.0 


A canonical correlation analysis of the head and leg sets of variables 
using R produces the two canonical correlations and corresponding pairs of 
variables 


a 0, = .781z\") + 345z$) 
Pet. oo Q) Q) 
V, = 0602) + .944z5 
and 
7 Uy = —.856z\") + 1.10625” 
pi = .057 


V, = -2.648z\) + 2.47522) 

Here 2) i= 41,2 and z) i = 1,2 are the standardized data values for sets 1 and 
2, respectively. The preceding results were taken from the SAS statistical software 
output shown in Panel 10.1. In addition, the correlations of the original variables 
with the canonical variables are highlighted in that panel. = 


Example 10.5 (Canonical correlation analysis of job satisfaction) As part of a larger 
study of the effects of organizational structure on “job satisfaction,” Dunham [4] in- 
vestigated the extent to which measures of job satisfaction are related to job charac- 
teristics. Using a survey instrument, Dunham obtained measurements of p = 5 job 
characteristics and q = 7 job satisfaction variables for n = 784 executives from the 
corporate branch of a large retail merchandising corporation. Are measures of job 
satisfaction associated with job characteristics? The answer may have implications 
for job design. 


PANEL 10.1 SAS ANALYSIS FOR EXAMPLE 10.4 USING PROC CANCORR. 


title ‘Canonical Correlation Analysis’; 

data skull (type = corr); 

_type_ = ‘CORR’; 

input __name_$ x1 x2 x3 x4; 

cards; 

xt 1.0 : 2 , 

x2 .505 = =1.0 ‘ f PROGRAM COMMANDS 
x3 «.569 A422 1.0 . 

x4 = .602 467 -926 1.0 


proc cancorr data = skull vprefix = head wprefix = leg; 
var x1 x2; with x3 x4; 


(continues on next page) 


554 Chapter 10 Canonical Correlation Analysis 


PANEL 10.1 (continued) 


Canonical Correlation Analysis 
Adjusted Approx Squared 
Canonical Canonical Standard Canonical 
Correlation Correlation Error Correlation 


0.631085 0.628291 0.036286 0.398268 


0.056794 0.060108 0.003226 


Raw Canonical Coefficient for the ‘VAR’ Variables | 


HEAD1 HEAD2 
x1 0.7807924389 -0.855973184 
X2 0.3445068301 1.1061835145 


OUTPUT 


Raw Canonical Coefficient for the ‘WITH’ Variables 


LEG1 LEG2 
0.0602508775 -2.648 156338 
0.943948961 2.4749388913 


Canonical Structure 


Correlations Between the ‘VAR’ Variables and Their Canonical Variables 


HEAD1 HEAD2 
x1 0.9548 0.2974 (see 10-29) 
X2 0.7388 0.6739 


Correlations Between the ‘WITH’ Variables and Their Canonical Variables 


LEG1 LEG2 
x3 0.9343 0.3564 (see 10-29) 
x4 0.9997 0.0227 


Correlations Between the 'VAR' Variables 
and the Canonical Variables of the ‘WITH’ Variables 


LEG1 LEG2 
x1 0.6025 0.0169 (see 10-29) 
X2 0.4663 0.0383 


Correlations Between the ‘WITH’ Variables 
and the Canonical Variables of the VAR‘ Variables 


HEAD1 HEAD2 
X3 0.5897 0.0202 (see 10-29) 
x4 0.6309 0.0013 


The Sample Canonical Variates and Sample Canonical Correlations 555 


The original job characteristic variables, X"!), and job satisfaction variables, 
x) were respectively defined as 


xe feedback 
x Q) task significance 
x = x) = task variety 
xp task identity 
xX ™ autonomy 
x @) supervisor satisfaction 
xX 2) career-future satisfaction 
Xx @) financial satisfaction 
x =| x?) | =| workload satisfaction 
x @) company identification 
x? kind-of-work-satisfaction 
x (2) general satisfaction 


Responses for variables X") and X) were recorded on a scale and then stan- 
dardized. The sample correlation matrix based on 784 responses is 


Ri | Bi] 
R= ceestetecbeneees tte 
Ee i Roo 
1.0 33 32 20 19 30 37 221 
49 10 30.21 16 «08273520 
53 57 10 31 23 14 O7 24 37 18 
49 46 48 10 OA 92) AO 0° D1 220-16 


teen tt ee eee teen teen nnn n een nen n errno corn meree fener ae nanan nanan a temens saw nemnnnnner eres eennnecnecennees Hnwnnwenenennnne 


30.27 24 «21 32) 34 54 46 2B 10 
37) 35) 37.29 36} 37) «—.32 29 30 351.0 
21 20 18 16 27; 40 «58 45 27 «59 «31 1.0 | 


The min(p,q) = min(5,7) = 5 sample canonical correlations and the sample 
canonical variate coefficient vectors (from Dunham [4]) are displayed in the 
following table: 


Le- 


69°— 
th — 


IZ 


a ee <a SC Ss AS _60= 
6c LO- «=e SQ sc 6y- oT -—sat’ 
oT 616 iQ Re: bos | SP So" 
ee UOC lg rh  O- LT Iz’ 
oe fv Fe ww ww on Fo? 
so]qeea pozipsepueys sa[qeirea pozipiepueys 


SUONEIIIIOZ [BOTMOULD PUL SJUSIDJJoD oVLIeA [BOTUOURD 


* 556 


The Sample Canonical Variates and Sample Canonical Correlations 557 


For example, the first sample canonical variate pair is 


O, = 22) + 2126) + 172 — 022!) + aazi!) 
V, = 422) + 222%) — 032) + 012?) + 292) + 522 - 122?) 


with sample canonical correlation p* = .55. 


A 


According to the coefficients, U; is primarily a feedback and autonomy 
variable, while V,; represents supervisor, career-future, and kind-of-work satisfaction, 
along with company identification. | x 7 

To provide interpretations for U, and_V,, the sample correlations between U, 
and its component variables and between V, and its component variables were com- 
puted. Also, the following table shows the sample correlations between variables in 
one set and the first sample canonical variate of the other set. These correlations can 
be calculated using (10-29). 


Sample Correlations Between Original Variables and Canonical Variables 


Sample Sample 

canonical canonical 

variates variates 

xX) variables 0; V, x) variables V; 
1. Feedback 83 46 1. Supervisor satisfaction 42 75 
2. Task significance 74 Al 2. Career-future satisfaction 35 65 
3. Task variety 75 42 3. Financial satisfaction 21 39 
4. Task identity 62 34 4. Workload satisfaction 21 37 
5. Autonomy 85 48 5. Company identification 36 65 
6. Kind-of-work satisfaction 44 -80 
7. General satisfaction 28 50 


All five job characteristic variables have roughly the same correlations with the 
first canonical variate U,. From this standpoint, U; might be interpreted as a job 
characteristic “index.” This differs from the preferred interpretation, based on 
coefficients, where the task variables are not important. _ 

The other member of the first canonical variate pair, V, , seems to be represent- 
ing, primarily, supervisor satisfaction, career-future satisfaction, company identifica- 
tion, and kind-of-work satisfaction. As the variables suggest, V, might be regarded as 
a job satisfaction-company identification index. This agrees with the preceding 
interpretation based on the canonical coefficients of the 5 The sample correla- 
tion between the two indices U, and V, is pj = .55. There appears to be some over- 
lap between job characteristics and job satisfaction. We explore this issue further in 
Example 10.7. = 


Scatter plots of the first ( U, ; Vv) pair may reveal atypical observations x; requir- 
ing further study. If the canonical correlations p3, p3, ... are also moderately large, 


558 Chapter 10 Canonical Correlation Analysis 


scatter plots of the pairs (U2, Vz), (U3, V3),... may also be helpful in this 
Many analysts suggest plotting “significant” Canionica) variates against their 
nent variables as an aid in subject-matter interpretation. These plots reinfo 
correlation coefficients in (10-29). 

If the sample size is large, it is often desirable to split the sample in h 
first half of the sample can be used to construct and evaluate the sample 
cal variates and canonical correlations. The results can then be “validates 
the remaining observations. The change (if any) in the nature of the ca 
analysis will provide an indication of the sampling variability and the stab 
the conclusions. } 


10.5 Additional Sample Descriptive Measures 


If the canonical varjates are “good” summaries of their respective sets of v: 
then the associations between variables can be described in terms of the: 
variates and their correlations. It is useful to have summary measures of the 
to which the canonical variates account for the variation in their TeSpective 5 
also useful, on occasion, to calculate the proportion of variance in one set o} 
ables explained by the canonical variates of the other set. 


Matrices of Errors of Approximations 


Given the matrices A and B defined in (10- -27), let a(? and B denote the ith’ 
of A? and B™, respectively. Since U = Ax“) and V = Bx®) we can write - 


m= At G  x@ = gt 
(px1) (pXp) (px) (qx1) (9X9) (qx1) 


Because sample Cov(U, Vv) = AS, 2B’, sample Cov(U) = AS, ;A' = 


sample Cov(V) = BS,,B' = I a 


~ (qXq)’ 
ae 
Sip = At] 9 PFO Fo | BY = ABO + FRAMHO™ 
00 -: a +. + prarp(r” 


$1, = (AD) (AT) = aa 4 g@a@ +... + alate)’ 
Soo = (BY) (BY = BOBO” + BHD +. + BIMBO 
Since x) = A1U and U has sample covariance I, the first | r col 


contain the sample covariances of the first r canonical variates U 1s th, 
their component variables X uy X2 ioe sxe) (1) Similarly, the first r colum 


contain the sample covariances of a. th... Ka with their component v% 


Additional Sample Descriptive Measures 559 


If only the first r canonical pairs are used, so that for instance, 
0; 


x00) = pa) ay... 4 gin] | & 


Sree 


and (10-33) 


x2) = [b) : r5e3) re 6”) 


Nowe 


then $,> is approximated by sample Cov (¥), ¥). 
Continuing, we see that the matrices of errors of approximation are 


Si — (ADAM + AAD 4. + AMA) = AalDAC TD 4... + aledale” 
Sop — (BOBO + HAHA’ +--+ OHO’) = HOHDBC HD? 4... + HOODEO! 
S12 — (prabO” + peal’ +--+ pra HO’) 
= pe Alt DHCD! 4... + pealplr)’ 
(10-34) 


The approximation error matrices (10-34) may be interpreted as descriptive 
summaries of how well the first r sample canonical variates reproduce the sample 
covariance matrices. Patterns of large entries in the rows and/or columns of the ap- 
proximation error matrices indicate a poor “fit” to the corresponding variable(s). 

Ordinarily, the first r variates do a better job of reproducing the elements of 
$2 = $5, than the elements of $,; or S22. Mathematically, this occurs because the 
residual matrix in the former case is directly related to the smallest p — r sample 
canonical correlations. These correlations are usually all close to zero. On the other 
hand, the residual matrices associated with the approximations. to the matrices S,; and 
S22 depend only on the last p — r and q — r coefficient vectors. The elements in 
these vectors may be relatively large, and hence, the residual matrices can have “large” 
entries. 2 " 

For standardized observations, R,; replaces $,, and a{*), b{”) replace a“), b“ 
in (10-34). 


Example 10.6 (Calculating matrices of errors of approximation) In Example 10.4, we 
obtained the canonical correlations between the two head and the two leg variables 
for white leghorn fowl. Starting with the sample correlation matrix 


1.0 505 569 602 


569.422; 1.0 926 
602 467; 926 1.0 


560 Chapter 10 Canonical Correlation Analysis 


we obtained the two sets of canonical correlations and variables 


eee th, = 7812) + 34524” 
Re V, = 0602) + 94422) ; 


and 
Sos ed th = 8562 + 1.10625” 
p=. ‘ 
; Vy = -2.64822) + 2.47529 
where zg) ,i = 1,2 and we) i = 1,2 are the standardized data values for sets 1 and ~ 


2, respectively. 
We first calculate (see Panel 10.1) 


Aaa | -781 345 1 | 9548 —.2974 
7 ~ | —.856 1.106} — |..7388 — .6739 


gu — | 9343 3564 
5 9997 0227 


Consequently, the matrices of errors of approximation created by using only the 
first canonical pair are 


—.2974 
R,2 — sample Cov(Z™, Z) = cosn) | ni [-.3564 .0227] 
_ | .006 -.000 
~ | —014 001 
—.2974 
- WY) = -.29714 .6739 
Rj), — sample Cov(z"’) ll ] 
_ | 088 -.200 
~ | =.200 454 
3564 
= s(2)) = ~.3564 .0227 
R,. — sample Cov(z*)) el ] 


where Z), 7) are given by (10-33) with r = 1 and a), B® replace a”, b®, 


respectively. 


Additional Sample Descriptive Measures 561) 


We see that the first pair of canonical variables effectively summarizes (repro- 
duces) the intraset correlations in R,2. However, the individual variates are not 
particularly effective summaries of the sampling variability in the original z“) and 
z) sets, respectively. This is especially true for U;. ] 


Proportions of Explained Sample Variance 


When the observations are standardized, the sample covariance matrices S,; are 
correlation matrices R,;. The canonical coefficient vectors are the rows of the 
matrices A, and B, and the columns of Az 1 and B; 7! are the sample correlations 


between the canonical variates and their component variables. 
Specifically, 


sample Cov (z), U) = sample Cov (Az0, U) = Ay! 


and 
sample Cov(z®), V) = sample Cov (B;'V, V) = By! 
so 
[ va, ate TU, got TG.) 
A j= (al? : a? al) = Oya Hi Un 2p Tae 
L?O,,2") Mi mse re 
[ re, Pau "y,, art ~~. Ye) 
Bet (OB) ig BE S| a ee Meet) 210885) 
TY ,,22 Ty2,20 nite TY 4.29) 


where ry, and ry, 7 are the sample correlation coefficients between the quantities 
with subscripts. 
Using (10-32) with standardized observations, we obtain 


Total (standardized) sample variance in first set 
= tr(Ry1) = tr (ala! (1), +4 aa (2)r ees alP al?) =p (10-36a) 
Total (standardized) sample variance in second set 
= tr(Ro2) = tr(bO HO + pO HO” 4 --- + bb") = q — (10-36b) 


Since the correlations in the first r << P columns of Ali and Bz! involve only the 
sample canonical variates Uy, Up,.. _U, and V;, Vo, -- Vv, respectively, we define 


562 Chapter 10 Canonical Correlation Analysis 


the contributions of the first r canonical variates to the total (standardized) s 


variances as iid 


1 1), a(2)4(2), a(r)a(r), 
tr (ala 4 ge 4... 4 a@Maley — >> wT 
r=1k= Hie 
and 
Aq)ys Al) 0 P. 
tr (BMH + HHL” +--+ HHP = SH 2 
i=] k=1 7" 


The proportions of total (standardized) sample variances “explained by” the firs 
canonical variates then become 


proportion of total standardized 
ROG, O04, = sample variance in first set 
explained by U,,U,,....U, 


. or (a) al Q) 4 Sed aie ala”) 
=e a 
r s 2 
re 
Ae Yin? 
P 


and 


proportion of total standardized 
R209 ,,¥,,..0, = | Sample variance in second set 
explained by ), W,... 


by 


tr (R22) ( 
is gq 
> q 


Descriptive measures (10-37) provide some indication of how well the ae 
cal variates represent their respective sets. They provide single-number descriptions 


pan 


of the matrices of errors. In particular, 3 
1 (la (2a A(r)A ee a 
pean = a ~ APA — APA] = 1 RNG, 


1 "Cort (25 ME tare % a 
qi (Ro _ Nae sed i. bp ee eee bf Ip! My =l1- R205, Fae .n¥, 


according to (10-36) and (10-37). 


Large Sample Inferences 563 


Example 10.7 (Calculating proportions of sample variance explained by canonical 
variates) Consider the job characteristic_job satisfaction data discussed in 
Example 10.5. Using the table of sample correlation coefficients presented in that 
example, we find that 


1< 1 
Rig, = 5 2a ap = (BY + (4 + + (85 = 58 


R27, = 


i, 2 ap = UCTS)? + (6598 + + (50))] = 


als 


The first sample canonical variate 0, of the job characteristics set accounts for 58% 
of the set’s total sample variance. The first sample canonical variate V, of the job 
satisfaction set explains 37% of the set’s total sample variance. We might thus infer 
that U, is a “better” representative of its set than V is of its set. The interested read- 
er may wish to see how well U, and V, reproduce the correlation matrices Rj, and 
R22, respectively. [See (10-29).] = 


10.6 Large Sample Inferences 
When &,, = 0, a’X) and b’X® have covariance a’Z2b = 0 for all vectors a and 
b. Consequently, all the canonical correlations must be zero, and there is no point in 


pursuing a canonical correlation analysis. The. next result provides a way of testing 
X12 = 0, for large samples. 


Result 10.3. Let 


be a random sample from an Np+.(#, %) population with 


21} Zaz 
(pXp) | (pxq) 
Soe |e: eres 
X21} Loz 


(qXp) (9X4) 


Then the likelihood ratio test of Ho: X12 = 0 versus Hy: X12 # 0 rejects Ho for 
large values of (pxq) (pxq) 


$1 [|S 
~2InA = nin( 'Sull’e2!) =-nIn U (1 — p (10-38) 


564 Chapter 10 Canonical Correlation Analysis 


where 


S,: 8 
$= Su fa 
E S2 


is the unbiased estimator of &. For large n, the test statistic ( 10-38) is approy 
distributed as a chi-square random variable with pq d.f. 


Proof. See Kshirsagar [8]. 


The likelihood ratio statistic (10-38) compares the sample generalized 
under Ho, namely, 


i 0 


0’ a = |$11 || S22] 


with the unrestricted generalized variance | § |. 

Bartlett [3] suggests replacing the multiplicative factor n in the like 
ratio statistic with the factor n ~ 1 — (p + q + 1) to improve the y? appx 
mation to the sampling distribution of —21In A. Thus, for m and n — (, 
large, we 


Reject Ho: 212 = 0 (p; = p) = +--+ = p, = 0) at significance level aif 


-(n=1 -F(p+aqt 1) )m {I 1 = 6) > xbele) 


i=1 


where X%,(@) is the upper (100a)th percentile of a chi-square distributi 
pq ae. 

If the null hypothesis Ho: 212 = 0 (p; = p2 = --* = p, = 0) is rejected, it 
ural to examine the “significance” of the individual canonical correlations. Sin 
canonical correlations are ordered from the largest to the smallest, we can be 
assuming that the first canonical correlation is nonzero and the remainin 
canonical correlations are zero. If this hypothesis is rejected, we assume that 
two canonical correlations are nonzero, but the remaining p — 2 canonical 
tions are zero, and so forth. 

Let the implied sequence of hypotheses be 


H6: pt # 0, p # 0,..., pg #0, Pert = "7° = Pp = 9 


Hk: pf # 0, forsomei = k + 1 


Large Sample Inferences 565 


Bartlett [2] has argued that the kth hypothesis in (10-40) can be tested by the likeli- 
hood ratio criterion. Specifically, 


Reject 1 Oat significance level a if 


Pp ea 
-(n ot ig 5 (P “te qt 1)) In i qa ae p?’) > Xtp-k)(q-k)(@) (10-41) 


where Xtp- k)(q-k)(@) is the upper (100a)th percentile of a chi-square distribution 
sine (p ~ k)(q — k) df We point out that the test statistic in (10-41) involves 


Il (1 — p??), the “residual” after the first k sample canonical correlations have 
i=k+1 


been removed from the total criterion A?" = Il (1 — pi). 
i=l 

If the members of the sequence Ho, HS), H®, and so forth, are tested one at 
a time until H{ 0 ) is not rejected for some k, the overall significance level is not a 
and, in fact, would be difficult to determine. Another defect of this procedure is the 
tendency it induces to conclude that a null hypothesis is correct simply because it is 
not rejected. 

To summarize, the overall test of significance in Result 10.3 is useful for multi- 
variate normal data. The sequential tests implied by (10-41) should be interpreted 
with caution and are, perhaps, best regarded as rough guides for selecting the num- 
ber of important canonical variates. 


Example 10.8 (Testing the significance of the canonical correlations for the job satis- 
faction data) Test the significance of the canonical correlations exhibited by the job 
characteristics—job satisfaction data introduced in Example 10.5. 

All the test statistics of immediate interest are summarized in the table on 
pase 566. From n Example 10.5,n = 784, p = 5.q =7, pi = 55, p3 = .23, p} = -12, 

p4 = = 08, and ps = 05. 

Assuming multivariate normal data, we find that the first two canonical correla- 
tions, p; and pp, appear to be nonzero, although with the very large sample size, 
small deviations from Zero will show up as statistically significant. From a practical 
point of view, the second (and subsequent) sample canonical correlations can prob- 
ably be ignored, since (1) they are reasonably small in magnitude and (2) the corre- 
sponding canonical variates explain very little of the sample variation in the variable 
sets KX) andX), ta 


The distribution theory associated with the sample canonical correlations and 
the sample canonical variate coefficients is extremely complex (apart from the 
p = 1 and q = 1 situations), even in the null case, X;7 = 0. The reader interested in 
the distribution theory is referred to Kshirsagar [8]. 


TRI = 0 = Sd =... = fd 


€=7 
(7-5)(7-4) Qi - di a(a +b4 at ~1- u)- ‘O# Wo # Id: 8H '€ 


i 


“Oy welariouod §gs‘oe = (10°)X st 


7'09 = 0 = id =... = td 
fst 
“Hy paley g6Z = (10)"X w= (1 -5)(1-4) (24-1) Hu (a + b+ ae ~t- u)- ‘ow 8H % 
TOve = 
(€sp9") UT (« +L+ s)< -1- rat) = 0 = (oT) 
I=! 
ype is =(10)K  —se=W)s=bd Gi) TJ a (« tb+de-1- u)- paris 
g 
uoIsnyfauoZg uwonnarysip wopsey Jo seaidaq (wonsalios yWa]1eg) stsayjoddy ]JnN 
Xt jo 314S1]B3S 183] PoAlasqO 


yutod % | reddy 


S}[NSOY ISA], 


566 


Exercises 


Exercises 567 


rr 


10.1. 


10.2, 


10.3. 


10.4, 


Consider the covariance matrix given in Example 10.3: 


x 100 0;0 0 
cov | 22] | = [p22] [02 95. 
x Xai | X22 0 9541 0 
x?) 0 0 {0 100 


Verify that the first pair of canonical variates are U, = X us Y=X (2) with canonical 
correlation pj = .95. 


The (2 X 1) random vectors X‘) and X‘?) have the joint mean vector and joint covari- 
ance matrix 


mem W 
| 
= 


(a) Calculate the canonical correlations p7, p>. 
(b) Determine the canonical variate pairs (U,,V,) and (U2, V2). 
(c) Let U = [U,, U2]’ and V = [V,, V2]'. From first principles, evaluate 


(Ge) = (Qe) -Leetae 


Compare your results with the properties in Result 10.1, 


Let 2) = Vyl2(x) — pO) and Z@ = Vz¥?(x@) — w(?)) be two sets of standard- 
ized variables If pj, ,... , pp are the canonical correlations for the X"!), X) sets and 
(U,,V;) = (atX) bfx @}), i = 1,2,..., p, are the associated canonical variates, deter- 
mine the canonical correlations and canonical variates for the Z"!), Z(2) sets. That is, 
express the canonical correlations and canonical variate coefficient vectors for the Z"), 
Z") sets in terms of those for the X), X) sets. 


(Alternative calculation of canonical correlations and variates.) Show that, if A; is an 
eigenvalue of £7]}/%17%7}%.1%7}” with associated eigenvector e;, then A; is also an 
eigenvalue of £1}%12%7)2Z2; with eigenvector £]}/"e;. 
Hint: | X7V?X42¥54}¥21¥;}? — AA | = 0 implies that 

O = | Ej}? || S777 Ey2¥2} Ea Ei}? — Ad || LIP | 


= | 2712 12¥73E2, — All| 


568 Chapter 10 Canonical Correlation Analysis 


10.5. Use the information in Example 10.1. 


(a) Find the eigenvalues of 27. 1% 12822221 and verify that these eigenvalues are the 
same as the eigenvalues of £7}, 225322, 37}7. 


(b) Determine the second pair of canonical variates (U, V2) and verify, from first princi 
ples, that their correlation is the second canonical correlation p> = .03. 


10.6. Show that the canonical correlations are invariant under ppasinewer linear transforma 
tions of the X@) x2) variables of theform C X) and D xX?) 
(Pxp) (Px1) (4Xq) (9x1) 


Hint: Consider Cov ([: a Consider any linear combi 


nation aj(CX)) = a’X) with a’ = a‘C. Similarly, consider bi(DX)) = b’x@ 
with b’ = biD. The choices aj = e’ 27 Vig " and bj = f’E7}7D" give the maximum 4 
correlation. :; 


1 
10.7. Let Pi2 = fe ‘| and P); = P22 = E ? | aesponting to the equal correlation - 


structure where X) and X) each have two components. 
(a) Determine the canonical variates corresponding to the nonzero canonical correlation. 
(b) Generalize the results in Part a to the case where X) has p components and X') 
has q = p components. 
Hint: P12 = pil’, where Lis a (p X 1) column vector of 1’s and 1’ isa (q X 1) row 
vector of 1’s. Note that 9;;l=[1 + (p — 1)p]L so py}72=[1+(p- 1p) "1. 
10.8. (Correlation for angular measurement.) Some observations, such as wind direction, are in , 
the form of angles. An angle 62 can be represented as the pair X 7) = [cos(62), sin(@,)]’. 
(a) Show that b’X(2) = V/b? + b3cos(6) — B) where b,/ Vb? + b3 = cos(B) and 
b2/ V bt + b3 = sin(B). 
Hint: cos(6, — B) = cos(82) cos(B) + sin(@2) sin(f). 
(b) Let X“) have a single component X 4 )_ Show that the single canonical correlation is 
pi = max Corr (X {) cos(@2 — B)). Selecting the canonical variable V; amounts to 


selecting a new origin f for the angle 8). (See Johnson and Wehrly [7].) 


(c) Let X (1) be ozone (in parts per million) and 6. = wind direction measured froin the 
north. Nineteen observations made in downtown Milwaukee, Wisconsin, give the 
sample correlation matrix 


ozone cos(62) sin (62) 


Find the sample canonical correlation p and the canonical variate V, represents 
the new origin B. 

(d) Suppose X") is also aagulbe measurements of the form X‘)) = [cos (6;), sin ()l'«. 
Then a’X) = Va? + af cos(@, — a). Show that 


pi = max Corr (cos (6; — @),cos (8 — B)) 


10.9. 


10.10. 


Exercises 569 


(e) Twenty-one observations on the 6:00 A.M. and noon wind directions give the correla- 
tion matrix 


cos(6,) sin(4,) , cos(42) sin(62) 


1.0 ~.291) 440 372 
Rw | ne OO 208243 
- 440 -.205; 1.0 181 


372.243} 1811.0 


Find the sample canonical correlation p* and U;, V;. 
The following exercises may require a computer. 


H. Hotelling [5] reports that mn = 140 seventh-grade children received four tests 
on X (1) =reading speed, X Q) =Teading power, X (2) _ arithmetic speed, and 
xX i) = arithmetic power. The correlations for performance are 


1.0 6328! 2412 0586 
R21 {R22 2412 ~.0553! 1.0 4248 


0586 0655} .4248 1.0 


(a) Find all the sample canonical correlations and the sample canonical variates. 
(b) Stating any assumptions you make, test the hypotheses 
Ho: 212 = Pi2=0 — (p} = p2 = 0) 
Ay: 212 = Pir #0 
at the a = .05 level of significance. If Ho is rejected, test 
H$):p} # 0,9) = 0 
H (1) 1p, #0 
with a significance level of a = .05. Does reading ability (as measured by the two 
tests) correlate with arithmetic ability (as measured by the two tests)? Discuss. 
(c) Evaluate the matrices of approximation errors for R,,, R22, and R 2 determined by 
the first sample canonical variate pair U,, V;. 


In a study of poverty, crime, and deterrence, Parker and Smith [10] report certain sum- 
mary crime Statistics in various states for the years 1970 and 1973. A portion of their 
sample correlation matrix is 


1.0 615} ~.111 —.266 
pa | RR] 81510} 195-085 
R2 /Ro, —111 195; 10° —.269 


—.266 ~.085} —.269 1.0 


The variables are 

x) = 1973 nonprimary homicides 

xX ) = 1973 primary homicides (homicides involving family or acquaintances) 
4 (2) = 1970 severity of punishment (median months served) 


xX (2) = 1970 certainty of punishment (number of admissions to prison divided by 
number of homicides) 


570 Chapter 10 Canonical Correlation Analysis 


(a) Find the sample canonical correlations. 

(b) Determine the first canonical pair U; , V; and interpret these quantities, 
10.11. Example 8.5 presents the correlation matrix obtained from n = 103 SUCCESS 

weekly rates of return for five stocks. Perform a canonical correlation analysis. ¥ 

x) = [xi x @) xy, the rates of return for the banks, and X@) = Lx?) xe 

the rates of pees for the oil companies. ; 
10.12. A random sample of n = 70 families will be surveyed to determine the associatio 


between certain “demographic” variables and certain “consumption” variables. 
Let 


It 


Criterion 1x annual frequency of dining at a restaurant 


set xs) = annual frequency of attending movies 
xX(? = age of head of household 
Predj 
EeeiCLOE x?) = annual family income 


t 
“i x?) = educational level of head of household 


37 1.0 
21 35 10 


(a) Determine the sample canonical correlations, and test the hypothesis Ho: X12 
(or, equivalently, 9,2 = 0) at the a = .05 level. If Ho is rejected, test for the signifi 
cance (@ = .05) of the first canonical correlation. 


(b) Using standardized variables, construct the canonical variates corresponding to th 
“significant” canonical correlation(s). 

(c) Using the results in Parts a and b, prepare a table showing the canonical variate co-~ 
efficients (for “significant” canonical correlations) and the sample correlations 01 
the canonical variates with their component variables. 


(d) Given the information in (c), interpret the canonical variates. 


(e) Do the demographic variables have something to say about the consumption vari- : 
ables? Do the consumption variables provide much information about the deme: 


: : 9 
graphic variables? as 


10.13. Waugh [12] provides information about n = 138 samples of Canadian hard red sprin = 
wheat and the flour made from the samples. The p = 5 wheat measurements (in stanizg 
dardized form) were 


zj. = kernel texture 
zs) = test weight 
zh) = damaged kernels 


zee 


tt 


foreign material 


zi) = crude protein in the wheat 


Exercises 571 


The g = 4 (standardized) flour measurements were 
(2) 


zy = wheat per barrel of flour 
zi?) = ash in flour 
22) = crude protein in flour 


zi) = gluten quality index 


The sample correlation matrix was 


Rit! Ba 
R = |----5----b-----=- 
ES i R22 


—A79 —419 361.461 -.505} 251 10 
780 542 —.546 —.393 .737;-.490 ~.434 1.0 
-—.152 -.102 172 ~-.019 -.148: .250 -.079 -—.163 1.0 


(a) Find the sample canonical variates corresponding to significant (at the a = .01 
level) canonical correlations. 


(b) Interpret the first sample canonical variates U; ; Vv. Do they in some sense represent 
the overall quality of the wheat and flour, respectively? 


(c) What proportion of the total sample variance of the first set Z"”) is explained by the 
canonical variate U,? What proportion of the total sample variance of the Z') set is 


explained by the canonical variate V;? Discuss your answers. 
10.14. Consider the correlation matrix of profitability measures given in Exercise 9.15. Let x) 
= rx{?, ane xy be the vector of variables representing accounting measures 


of profitability, and let X) = [x??, XP be the vector of variables representing the 
two market measures of profitability. Partition the sample correlation matrix accordingly, 
and perform a canonical correlation analysis. Specifically, 


(a) Determine the first sample canonical variates U,, V; and their correlation. Interpret 
these canonical variates. 

(b) Let Z“) and Z) be the sets of standardized variables corresponding to X!) and X?, 
respectively. What proportion of the total sample variance of Z"!) is explained by 
the canonical variate U,? What proportion of the total sample variance of Z') is 
explained by the canonical variate V,? Discuss your answers. 

10.15. Observations on four measures of stiffness are given in Table 4.3 and discussed in Exam- 

ple 4.14. Use the data in the table to construct the sample covariance matrix §. Let x 

=[X (1) ak My be the vector of variables representing the dynamic measures of stiffness 


(shock wave, vibration), and let X(2) = [X?), X{?)]’ be the vector of variables represent- 
ing the static measures of stiffness. Perform a canonical correlation analysis of these data. 


572 Chapter 10 Canonical Correlation Analysis 


10.16. 


10.17. 


Andrews and Herzberg [1] give data obtained from a study of a comparison of ae 
betic and diabetic patients. Three primary variables, 


1 Ps 
xX ( _ glucose intolerance 

1 <. A 
xX ) = insulin response to oral glucose 
xX Q) = insulin resistance 


and two secondary variables, 
x?) = relative weight. 
Xx ) = fasting plasma glucose 


were measured. The data for n = 46 nondiabetic patients yield the covariance m 


1106.000 396.700 108400: 787 26.230 

gate 396.700 2382.000 143.000 ; ~.214 ~23.960 

S = [seg] =| 108.400 1143.000 2136.00 ; 2.189 —20.840 
21 £922 787 —214 2.189; .016 216 
26.230 —23.960 -20840: .216 70,560 


Determine the sample canonical variates and their correlations. Interpret these quan 
Are the first canonical variates good summary measures of their respective sets of 
ables? Explain. Test for the significance of the canonical relations with a = .05. 

Data concerning a person’s desire to smoke and psychological and physical state 
collected for n = 110 subjects. The data were responses, coded 1 to 5, to each of 12 
tions (variables). The four standardized measurements related to the desire to smoke: 
defined as 


zj’ = smoking 1 (first wording) 
zi = smoking 2 (second wording) 
zg) = smoking 3 (third wording) 
zi) = smoking 4 (fourth wordmg) 


The eight standardized measurements related to the psychological and physical state a 
given by 


Zi = concentration 

zy) = annoyance g 
zi?) = sleepiness 

22) = tenseness 

z) = alertness 

22) = irritability 

zi = tiredness 

ro ag = contentedness 


The correlation matrix constructed from the data is 


10.18. 


10.19, 


Exercises 573 


where z 


Ru=| 319 816 1.000 .845 
775 813 845 1.000 

086 144 140 222 101 .189 .199 .239 

hese 200 119 211 301 .223 .221 274 .235 

041 .060 .126 .120 .039 108 .139 .100 

228 122 277 214 201 156 .271 171 

1.000 562 457 579 .802 595.512. .492 

562 1.000 .360 .705 .578 .796 413 .739 

457 360 1.000 273 606 337 .798 .240 

Ae 579 705 =.273:«1.000 594 .725—364—S 711 


802 578 606 594 1.000 605 .698 .605 
595.796 = 337, 725, «60S 1.000 9.428 = .697 
512 413 798 364 698 428 1.000 .394 
492.739 240.711.605.697. «394 1.000 


Determine the sample canonical variates and their correlations. Interpret these quanti- 
ties. Are the first canonical variates good summary measures of their respective sets of. 
variables? Explain. 


The data in Table 7.7 contain measurements on characteristics of pulp fibers and the 
paper made from them. To correspond with the notation in this chapter, let the paper 
characteristics be 

x) = breaking length 

x = elastic modulus 

x4) = stress at failure 

x{)) = burst strength 


and the pulp fiber characteristics be 
x®) = arithmetic fiber length 
xf?) = long fiber fraction 
x) = fine fiber fraction 
x@) = zero span tensile 


Determine the sample canonical variates and their correlations. Are the first canonical 
variates good summary measures of their respective sets of variables? Explain. Test for 
the significance of the canonical relations with a = .05. Interpret the significant canoni- 
cal variables. 

Refer to the correlation matrix for the Olympic decathlon results in Example 9.6. Obtain 
the canonical correlations between the results for the nmning speed events (100-meter 
run, 400-meter min, long jump) and the arm strength events (discus, javelin, shot put). 
Recall that the signs of standardized running events values were reversed so that large 
scores are best for all events. 


574 Chapter 10 Canonical! Correlation Analysis 


References 


1. 


2. Bartlett, M.S, “Further Aspects of the Theory of Multiple Regression.” Proceedi 


11. 


. Bartlett, M. S.“A Note on Tests of Significance in Multivariate Analysis.” Proceed; 
. Dunham, R.B. “Reaction to Job Characteristics: Moderating Effects of the Ore 
. Hotelling, H. “The Most Predictable Criterion.” Journal of Educational Psychol 
. Hotelling, H. “Relations between Two Sets of Variables.” Biometrika, 28 (1936),32 


. Johnson, R. A., and T. Wehrly. “Measures and Models for Angular Correlatio 


. Kshirsagar, A. M. Multivariate Analysis. New York: Marcel Dekker, Inc., 1972. 
. Lawley, D. N. “Tests of Significance in Canonical Analysis.” Biometrika, 46 (1959), 59 
10. 


Andrews, D.F., and A. M. Herzberg. Data. New York: Springer-Verlag, 1985. 
the Cambridge Philosophical Society, 34 (1938), 33-40. 

the Cambridge Philosophical Society, 35 (1939), 180-185. 

tion.” Academy of Management Journal, 20, no. 1 ( 1977), 42-65. 

(1935), 139-142. 


Angular-Linear Correlation.” Journal of the Royal Statistical Society (B), 39 (19 
222-229, 


Parker, R. N,, and M. D. Smith. “Deterrence, Poverty, and Type of Homicide.” Ame 
Journal of Sociology, 85 (1979), 614-624. 


Rencher, A.C. “Interpretation of Canonical Discriminant Functions, Canonical V: 
and Principal Components.” The American Statistician, 46 (1992), 217-225. 


Chapter 


DISCRIMINATION AND CLASSIFICATION 


[1.1 Introduction 


Discrimination and classification are multivariate techniques concerned with 
separating distinct sets of objects (or observations) and with allocating new objects 
(observations) to previously defined groups. Discriminant analysis is rather 
exploratory in nature. As a separative procedure, it is often employed on a one-time 
basis in order to investigate observed differences when causal relationships are not 
well understood. Classification procedures are less exploratory in the sense that 
they lead to well-defined rules, which can be used for assigning new objects. Classi- 
fication ordinarily requires more problem structure than discrimination does. 

Thus, the immediate goals of discrimination and classification, respectively, are 
as follows: 


Goal 1. To describe, either graphically (in three or fewer dimensions) or alge- 
braically, the differential features of objects (observations) from sever- 
al known collections (populations). We try to find “discriminants” 
whose numerical values are such that the collections are separated as 
much as possible. 


Goal 2. To sort objects (observations) into two or more labeled classes. The em- 
phasis is on deriving a rule that can be used to optimally assign new ob- 
‘ jects to the labeled classes. 


We shall follow convention and use the term discrimination to refer to Goal 1. 
This terminology was introduced by R.A. Fisher [10] in the first modern treatment 
of separative problems. A more descriptive term for this goal, however, is separa- 
tion. We shall refer to the second goal as classification or allocation. 

A function that separates objects may sometimes serve as an allocator, and, 
conversely, a rule that allocates objects may suggest a discriminatory procedure. In 
practice, Goals 1 and 2 frequently overlap, and the distinction between separation 
and allocation becomes blurred. 


575 


576 Chapter 11 Discrimination and Classification 


11.2 Separation and Classification for Two Populations 


To fix ideas, let us list situations in which one may be interested in (1) separating two : 
classes of objects or (2) assigning a new object to one of two classes (or both). It ig - 
convenient to label the classes zr, and 72. The objects are ordinarily separated or” 
classified on the basis of measurements on, for instance, p associated random vayj. 
ables X’ = [.X,, X2,..., X,]. The observed values of X differ to some extent from - 
one class to the other.! We can think of the totality of values from the first class as” 
being the population of x values for 77 and those from the second class as the popu- : 
lation of x values for 772. These two populations can then be described by probabili.. « 
ty density functions f\(x) and f,(x), and consequently, we can talk of assigning 


observations to populations or objects to classes interchangeably. 


You may recall that some of the examples of the following separation— 


classification situations were introduced in Chapter 1. 


Populations 77, and 72 Measured variables X 
1. Solvent and distressed property-liability | Total assets, cost of stocks and bonds, market 
insurance companies. value of stocks and bonds, loss expenses, 
surplus, amount of premiums written. 
2. Nonulcer dyspeptics (those with upset Measures of anxiety, dependence, guilt, 
stomach problems) and controls perfectionism. 
(“normal”). 
3. Federalist Papers written by James Frequencies of different words and lengths of 
Madison and those written by sentences. 
Alexander Hamilton. 
4, Two species of chickweed. Sepal and petal length, petal cleft depth, bract 
length, scarious tip length, pollen diameter. 
5. Purchasers of a new product and Education, income, family size, amount of 
laggards (those “slow” to purchase). previous brand switching. 
6. Successful or unsuccessful (fail to Entrance examination scores, high school grade- 
graduate) college students. point average, number of high school activities. 
7. Males and females. Anthropological measurements, like 
circumference and volume on ancient skulls. 
8. Good and poor credit risks. Income, age, number of credit cards, family size. 
9. Alcoholics and nonalcoholics. Activity of monoamine oxidase enzyme, activity 


of adenylate cyclase enzyme. 


We see from item 5, for example, that objects (consumers) are to be separated 
into two labeled classes “purchasers” and “laggards”) on the basis of observed 
values of presumably relevant variables (education, income, and so forth). In the 
terminology of observation and population, we want to identify an observation of 


Tf the values of X were not very different for objects in 7, and 72, there would be no problem; 
that is, the classes would be indistinguishable, and new objects could be assigned to either class 


indiscriminately. 


Separation and Classification for Two Populations 577 


the form x’ = [x,(education), x,(income), x3;(family size), x,(amount of brand 
switching). as population 7, purchasers, or population 772, laggards. 

At this point, we shall concentrate on classification for two populations, return- 
ing to separation in Section 11.3. 

Allocation or classification rules are usually developed from “learning” sam- 
ples. Measured characteristics of randomly selected objects Known to come from 
each of the two populations are examined for differences. Essentially, the set of all 
possible sample outcomes is divided into two regions, R, and R>, such that if anew 
observation falls in R}, it is allocated to population 7, and if it falls in Rp, we allo- 
cate it to population 7. Thus, one set of observed values favors 7,, while the other 
set of values favors 772. 

You may wonder at this point how it is we know that some observations belong 
to a particular population, but we are unsure about others. (This, of course, is what 
makes classification a problem!) Several conditions can give rise to this apparent 
anomaly (see [20]): 


1. Incomplete knowledge of future performance. 


Examples: In the past, extreme values of certain financial variables were ob- 
served 2 years prior to a firm’s subsequent bankruptcy. Classifying another firm 
as sound or distressed on the basis of observed values of these leading indicators 
may allow the officers to take corrective action, if necessary, before it is too late. 


A medical school applications office might want to classify an applicant as 
likely to become M.D. or unlikely to become M.D. on the basis of test scores and 
other college records. Here the actual determination can be made only at the 
end of several years of training. 


2. “Perfect” information requires destroying the object. 


Example: The \ifetime of a calculator battery is determined by using it until 
it fails, and the strength of a piece of lumber is obtained by loading it until it 
breaks. Failed products cannot be sold. One would like to classify products as 
good or bad (not meeting specifications) on the basis of certain preliminary 
measurements. 


3. Unavailable or expensive information. 


Examples: It is assumed that certain of the Federalist Papers were written by 
James Madison or Alexander Hamilton because they signed them. Others of the 
Papers, however, were unsigned and it is of interest to determine which of the 
two men wrote the unsigned Papers. Clearly, we cannot ask them. Word fre- 
quencies and sentence lengths may help classify the disputed Papers. 

Many medical problems can be identified conclusively only by conducting 
an expensive operation. Usually, one would like to diagnose an illness from eas- 
ily observed, yet potentially fallible, external symptoms. This approach helps 
avoid needless—and expensive—operations. 


It should be clear from these examples that classification rules cannot usually 
provide an error-free method of assignment. This is because there may not be a 
clear distinction between the measured characteristics of the populations; that is, 
the groups may overlap. It is then possible, for example, to incorrectly classify a 72 
object as belonging to 7, or a 7, object as belonging to 772. 


578 Chapter 11 Discrimination and Classification 


Example 11.1 (Discriminating owners from nonowners of riding mowers) Consider 
two groups in a city: 77,, riding-mower owners, and 772, those without riding mowers— _ 
that is, nonowners. In order to identify the best sales prospects for an intensive sales 
campaign, a riding-mower manufacturer is interested in classifying families as— 
prospective owners or nonowners on the basis of x; = income and x2 = lot size, ' 
Random samples of n; = 12 current owners and nz = 12 current nonowners yield 
the values in Table 11.1. oe 


Table 11.1 : , 2 
7: Riding-mower owners a. Nonowners 

x, (Income Xz (Lot size x, (Income X (Lot size 

in $1000s) in 1000 ft?) in $1000s) in 1000 ft’) 
90.0 18.4 105.0 19.6 
115.5 16.8 82.8 20.8 
94.8 21.6 94.8 17.2 
91.5 20.8 32 20.4 
117.0 23.6 114.0 17.6 
140.1 19.2 79.2 17.6 
138.0 17.6 89.4 16.0 
112.8 22.4 96.0 18.4 
99.0 20.0 774 16.4 
123.0 20.8 63.0 18.8 
81.0 22.0 81.0 14.0 
111.0 20.0 93.0 14.8 


These data are plotted in Figure 11.1. We see that riding-mower owners tend to 
have larger incomes and bigger lots than nonowners, although income seems to be a 
better “discriminator” than lot size. On the other hand, there is some overlap be- 
tween the two groups. If, for example, we were to allocate those values of (x1, X2) 
that fall into region R, (as determined by the solid line in the figure) to 71, mower 
owners, and those (x,,%2) values which fall into R, to 72, nonowners, we would 
make some mistakes. Some riding-mower owners would be incorrectly classified as 
nonowners and, conversely, some nonowners as owners. The idea is to create a rule 
(regions R,; and R2) that minimizes the chances of making these mistakes. (See 
Exercise 11.2.) = 


A good classification procedure should result in few misclassifications. In other 
words, the chances, or probabilities, of misclassification should be small. As we shall 
see, there are additional features that an “optimal” classification rule should possess. 

It may be that one class or population has a greater likelihood of occurrence 
than another because one of the two populations is relatively much larger than the 
other. For example, there tend to be more financially sound firms than bankrupt 
firms. As another example, one species of chickweed may be more prevalent than 
another. An optimal classification rule should take these “prior probabilities of 
occurrence” into account. If we really believe that the (prior) probability of a finan- 
cially distressed and ultimately bankrupted firm is very small, then one should 


Lot size in thousands of square feet 


24 


16 


x2 


Separation and Classification for Two Populations 579 


© Riding-mower owners 


@ Nonowners 


See eee L Jt. 1. , Figure 11.1 Income and Iot size 
90 120 150 for riding-mower owners and 
Income in thousands of dollars nonowners. 


classify a randomly selected firm as nonbankrupt unless the data overwhelmingly 
favors bankruptcy. 

Another aspect of classification is cost. Suppose that classifying a 7, object as 
belonging to 72 represents a more serious error than classifying a 7. object as be- 
longing to 7). Then one should be cautious about making the former assignment. As 
an example, failing to diagnose a potentially fatal illness is substantially more “cost- 
ly” than concluding that the disease is present when, in fact, it is not. An optimal 
classification procedure should, whenever possible, account for the costs associated 
with misclassification. 

Let f,(x) and f,(x) be the probability density functions associated with the 
p X 1 vector random variable X for the populations 77, and 772, respectively. An ob- 
ject with associated measurements x ust be assigned to either 7 or 72. Let © be 
the sample space—that is, the collection of all possible observations x. Let Rj be that 
set of x values for which we classify objects as 7, and Ry = Q—R, be the remaining 
x values for which we classify objects as 72. Since every object must be assigned to 
one and only one of the two populations, the sets R; and R, are mutually exclusive 
and exhaustive. For p = 2, we might have a case like the one pictured in Figure 11.2. 

The conditional probability, P(211), of classifying an object as 7 when, in fact, 
it is from 7 is 


P(2I1) = P(X Ryi7) = ae f(x) dx (11-1) 


Similarly, the conditional probability, P(1|2), of classifying an object as 7, when it 
is really from 7 is 


P(112) = P(XeR la) = i fo(x) dx (11-2) 
Ry 


580 Chapter 11 Discrimination and Classification 


*2 


f ; ee eee 


for two populations. 


The integral sign in (11-1) represents the volume formed by the density function 
fi (x) over the region R». Similarly, the integral sign in (11-2) represents the volume 
formed by f(x) over the region R, . This is illustrated in Figure 11.3 for the univari- 
ate case, p = 1. 

Let p; be the prior probability of 7, and p2 be the prior probability of 7, 
where p, + Pp) = 1. Then the overall probabilities of correctly or incorrectly clas- 
sifying objects can be derived as the product of the prior and conditional classifi- 
cation probabilities: 


P(observation is correctly classified as 771) = P(observation comes from 77, 
and is correctly classified as 77) 
= P(Xe R,\71)P(71) = P(111)p, 
P( observation is misclassified as 71) = P(observation comes from 772 
and is misclassified as 7r;) 
= P(XERI72)P(m) = P(i12)p, 
P(observation is correctly classified as 72) = P(observation comes from 72 


and is correctly classified as 772) 
= P(XER | m2)P(m2) = P(212)m 


pity= ff, (x) dx 
PC2) = ff, (0 de R, 


Classify as 7, Classify as 7 


Figure 11.3 Misclassification probabilities for hypothetical classification regions 
when p = 1. 


Separation and Classification for Two Populations 581 


P(observation is misclassified as 7.) = P(observation comes from 77, 
and is misclassified as 772) 
= P(KER Im) P(m) = P(211)P1 
(11-3) 


Classification schemes are often evaluated in terms of their misclassification 
probabilities (see Section 11.4), but this ignores misclassification cost. For example, 
even a seemingly small probability such as 06 = P(211) may be too large if the cost 
of making an incorrect assignment to 772 is extremely high. A rule that ignores costs 
may cause problems. 

The costs of misclassification can be defined by a cost matrix: 


Classify as: 
7) 72 
0 c(211) 
c(112) 0 


7 (11-4) 


True population: 
72 


The costs are (1) zero for correct classification, (2) c(112) when an observation from 
a2 is incorrectly classified as 7, and (3) c(211) when a zr, observation is incorrect- 
ly classified as 7. 

For any rule, the average, or expected cost of misclassification (ECM) is provid- 
ed by multiplying the off-diagonal entries in (11-4) by their probabilities of occur- 
rence, obtained from (11-3). Consequently, 


ECM = ¢(211)P(2I1)p; + ¢(112)P(112)p, (11-5) 


A reasonable classification rule should have an ECM as small, or nearly as 
small, as possible. 


Result 11.1. The regions R,; and R, that minimize the ECM are defined by the 
values x for which the following inequalities hold: 


_ AC). (e112) (m 
a fi(x) (Sn) (2) 
density cost ’ 
( as = a(S) ai haa (11-6) 

_ A(x) — fe(112) 

Ro Bx) © = ) (2 2) 
density 

ratio 


Proof. See Exercise 11.3. a 


prior 


prior 
ot } probability 


tatio 
ratio 


It is clear from (11-6) that the implementation of the minimum ECM rule re- 
quires (1) the density function ratio evaluated at a new observation xo, (2) the cost 
ratio, and (3) the prior probability ratio. The appearance of ratios in the definition of 


582 Chapter 11 Discrimination and Classification 


the optimal classification regions is significant. Often, it is much easier to specify the 
ratios than their component parts. 

For example, it may be difficult to specify the costs (in appropriate units) of 
classifying a student as college material when, in fact, he or she is not and classifying 
a Student as not college material, when, in fact, he or she is. The cost to taxpayers of 
educating a college dropout for 2 years, for instance, can be roughly assessed. The 
cost to the university and society of not educating a capable student is more difficult 
to determine. However, it may be that a realistic number for the ratio of these mis- 
classification costs can be obtained. Whatever the units of measurement, not admit- 
ting a prospective college graduate may be five times more costly, over a suitable 
time horizon, than admitting an eventual dropout. In this case, the cost ratio is five. 

It is interesting to consider the classification regions defined in (11-6) for some 
special cases. 


Special Cases of Minimum Expected Cost Regions 


(a) p2/p: = 1 (equal prior probabilities) 
f(x) _ (112) a fix) _ e(112) 
fi(x) (211) f(x) © (211) 
(b) c(112)/c(211) = 1 (equal misclassification costs) 
p Pe ys VE Pe 
* f(x) ~ Pr A f(x) = Py 


(c) P2/py = c(112)/e(211) = 1 or py/py = 1/(c(112)/c(211)) 
(equal prior probabilities and equal misclassification costs) 

AX) 5p, fo 

70 ae i 


(11-7) 


Ri: Ry 


When the prior probabilities are unknown, they are often taken to be equal,and 
the minimum ECM rule involves comparing the ratio of the population densities to 
the ratio of the appropriate misclassification costs. If the misclassification cost ratio 
is indeterminate, it is usually taken to be unity, and the population density ratio is 
compared with the ratio of the prior probabilities (Note that the prior probabilities 
are in the reverse order of the densities.) Finally, when both the prior probabili- 
ty and misclassification cost ratios are unity, or one ratio is the reciprocal of the 
other, the optimal classification regions are determined simply by comparing the 
values of the density functions. In this case, if x9 is a new Observation and 
Fi(%o)/f2(Xo) = 1—that is, f{(xo) = fo(xo) —we assign Xq to 77). On the other hand, 
if fi(Xo)/fo(xo) < 1, or fi(xo) < fo(xo), we assign xp to 772. 

It is common practice to arbitrarily use case (c) in (11-7) for classification. This 
is tantamount to assuming equal prior probabilities and equal misclassification costs 
for the minimum ECM rule.” 


>This is the justification generally provided. It is also equivalent to assuming the prior probability 
ratio to be the reciprocal of the misclassification cost ratio. 


Separation and Classification for Fwo Populations 583 


Example 11.2 (Classifying a new observation into one of the two populations) A re- 
searcher has enough data available to estimate the density functions f(x) and f(x) 
associated with populations 77, and 72, respectively. Suppose c(211) = 5 units and 
c(112) = 10 units. In addition, it is known that about 20% of all objects (for which’ 
the measurements x can be recorded) belong to 77. Thus, the prior probabilities are 
P, = 8and po = 22. 

Given the prior probabilities and costs of misclassification, we can use (11-6) to 
derive the classification regions R, and Ry». Specifically, we have 


Ry a x (2) (2) = 
ns HL <(3) (2)- 5 


Suppose the density functions evaluated at a new observation x, give fi(Xo) = -3 
and f,(xo) = .4. Do we classify the new observation as 7, or 72? To answer the 
question, we form the ratio 


Filxo) 
fi(xo) 


and compare it with .5 obtained before. Since 


fioa) ~79> (Samy) (Ft) ~ 


we find that xo € R; and classify it as belonging to 7. 


3 
ar is 75 


Criteria other than the expected cost of misclassification can be used to 
derive “optimal” classification procedures, For example, one might ignore the costs 
of misclassification and choose R,; and R, to minimize the total probability of 
misclassification (TPM): 

TPM = P(misclassifying a 77, observation or misclassifying a 77 observation) 


= P(observation comes from 77, and is misclassified) 


+ P( observation comes from 772 and is misclassified) 
=p f pixar + pf flo ax (11-8) 
Ry R 


Mathematically, this problem is equivalent to minimizing the expected cost of 
misclassification when the costs of misclassification are equal. Consequently, the 
optimal regions in this case are given by (b) in (11-7). 


584 Chapter 11 Discrimination and Classification 
We could also allocate a new observation x, to the population with the largest 
“posterior” probability P(z;ixy). By Bayes’s rule, the posterior probabilities are 


P(7 occurs and we observe xp) 
P(we observe Xo) 


P(711X9) = 
7 P(we observe Xgl 77) P(71) 
~ P(we observe Xo!771)P(m) + P(we observe xq| 772) P(2) 


uu Pufi(Xo) 
Pifi(xo) + P2f2(Xo) 


P2f2(Xo) 
Pifi(Xo) + P2fo(Xo) 


Classifying an observation x) as 7; when P(7,|x9) > P(72Ix) is equivalent to 
using the (b) rule for total probability of misclassification in (11-7) because the de- 
nominators in (11-9) are the same. However, computing the probabilities of the pop- 
ulations 77; and 72 afler observing xg (hence the name posterior probabilities) is 
frequently useful for purposes of identifying the less clear-cut assignments. 


P(1721Xo) =1- P(77!xo) — (11-9) 


11.3 Classification with Two Multivariate Normal Populations 


Classification procedures based on normal populations predominate in statistical 
practice because of their simplicity and reasonably high efficiency across a wide va- 
riety of population models. We now assume that f\(x) and f(x) are multivariate 
normal densities, the first with mean vector yx, and covariance matrix X, and the 
second with mean vector yz and covariance matrix Zp. 

The special case of equal covariance matrices leads to a particularly simple lin- 
ear classification statistic. 


Classification of Normal Populations When >, = +2 = & 


Suppose that the joint densities of X’ = [X,, X2,..., X,] for populations 77, and 772 
are given by 


fi(x) = 


1 1 pars : 
(2m yr |= 12 exp| 9 {x — pj) (x - | fori = 1,2 (11-10) 


Suppose also that the population parameters my, #2, and & are known. Then, after 
cancellation of the terms (27)??| & |} the minimum ECM regions in (11-6) become 
1 2 1 ~ 
Ry: exp | —3 (x — py)'"'(x — wa) + 3 (x — peo)'X'(x - M2) | 
- (so) (2) 
c(211)/ \Pr 
1 _ 1 iat 
Ry: exp | —5 (x = py)'X "(x — wy) + 3 (* — p2)'E1(x - 2) | 


<(amy) (Fon 


Classification with Two Multivariate Normal Populations 585 


Given these regions R, and R>, we can construct the classification rule given in the 
following result. 


Result 11.2. Let the populations 7, and 72 be described by multivariate normal 
densities of the form (11-10). Then the allocation rule that minimizes the ECM is as 
follows: 

Allocate x9 to 7, if 


1\2 
(Hy ~ H2)'X 1x9 ~ 5m | Bo)'="( py + p2) = in| (2) ()| (11-12) 


Allocate x9 to 72 otherwise. 


Proof. Since the quantities in (11-11) are nonnegative for all x, we can take their 
natural logarithms and preserve the order of the inequalities. Moreover (see 
Exercise 11.5), 


5 (= py) EN( = pen) + 5 (8 = pe) EA = a) 


7 1 = 
= (wy — p2)'X'x - 3 (Hi ~ 2)'E Vy, + pz) (11-13) 


and, consequently, 


112 
Ry: (py ~ p2)'D x — 5m — p2)'I"(my + pr) = in| (252) (2) | 


112 
Ro: (py - f2)'i x a 5 (as Bo)’ = (py + p2) < in| (2) () | 


The minimum ECM classification rule follows. = 


In most practical situations, the population quantities x,, 2, and & are un- 
known, so the rule (11-12) must be modified. Wald [31] and Anderson [2] have sug- 
gested replacing the population parameters by their sample counterparts. 

Suppose, then, that we have n, observations of the multivariate random vari- 
able X’ = [X,, X2,...,X,] from 7, and n2 measurements of this quantity from 72, 
with mn; + n, — 2 = p. Then the respective data matrices are 


Xi1 

xX, _ | *12 

(1X p) : 
Xin (11-15) 
X21 

X, = X22 

(Xp) 


586 Chapter 11 Discrimination and Classification 


From these data matrices, the sample mean vectors and covariance matrices are 
determined by 


ee ES te - ah 
xX, = —S x1), S$; = S 00j 7 ¥1) (x1; - x1) 
(px1) 1 j=] (exp) om - 1h (11-1 
12 ie ae 
% =— Sx, 8S, =—— > (xj - ¥) (x; - %)! 
(x1) M2 j=1 a (Xp) ny-1 » : : 


Since it is assumed that the parent populations have the same covariance matrix ¥, 
the sample covariance matrices $, and S, are combined (pooled) to derive a single, 
unbiased estimate of & as in (6-21). In particular, the weighted average 


Ss, ny 1 ny — 1 r 
Spootea = E =) (ie 5 |s e Ee =1)+(m— 5 |: (tet?) 


is an unbiased estimate of © if the data matrices X, and X» contain random sam- 
ples from the populations 77; and 772, respectively. 

Substituting X for pe; X2 for pz, and Spooted for & in (11-12) gives the “sample” 
classification rule: 


The Estimated Minimum ECM Rule for Two Normal Populations 


Allocate Xx to 77, if 


112 
(%1 — ¥2)'Spootea Xo — + (% — X2)/Spootea (K+ 2) = in| (53) ()| 
(11-18) 


Allocate xg to 772 otherwise. 


If, in (11-18), 
(oR) (2) os 
c(211)/7 \pi 
then In(1) = 0, and the estimated minimum ECM rule for two normal populations 
amounts to comparing the scalar variable 
3 = (1 — ¥2)'SpooteaX = a'x (11-19) 


evaluated at xo, with the number 
bee 2 de aye are 
m= 3 — ¥2)'Spootea(% + ¥2) 


=F +) G2) 
where 
Dr = (%1 — %2)/Spooiea® = aM 
and 
Ye = (K1 — K2)/Spoored k2 = a'% 


Classification with Two Multivariate Normal Populations 587 


That is, the estimated minimum ECM rule for two normal populations is tanta- 
mount to creating two univariate populations for the y values by taking an appropri- 
ate linear combination of the observations from populations 7, and 72 and then 
assigning a new observation xq to 7, or 72, depending upon whether Jp = 4’Xo falls 
to the right or left of the midpoint m between the two univariate means jy, and 52. 

Once parameter estimates are inserted for the corresponding unknown popula- 
tion quantities, there is no assurance that the resulting rule will minimize the ex- 
pected cost of misclassification in a particular application. This is because the 
optimal rule in (11-12) was derived assuming that the multivariate normal densities 
f\(x) and f(x) were known completely. Expression (11-18) is simply an estimate of 
the optimal rule. However, it seems reasonable to expect that it should perform well 
if the sample sizes are large? 

To summarize, if the data appear to be multivariate normal‘, the classification 
statistic to the left of the inequality in (11-18) can be calculated for each new obser- 
vation xg. These observations are classified by comparing the values of the statistic 
with the value of In[(c(112)/c(211))( p2/p,)].- 


Example 11.3 (Classification with two normal populations—common % and equal 
costs) This example is adapted from a study [4] concerned with the detection of 
hemophilia A carriers, (See also Exercise 11.32.) 

To construct a procedure for detecting potential hemophilia A carriers, blood 
samples were assayed for two groups of women and measurements on the two 
variables, : : 


X, = logio( AHF activity) 
X2 = logig(AHF-like antigen) 


recorded. (‘AHF” denotes antihemophilic factor.) The first group of n; = 30 
women were selected from a population of women who did not carry the hemophilia 
gene. This group was called the normal group. The second group of nz = 22 women 
was selected from known hemophilia A carriers (daughters of hemophiliacs, 
mothers with more than one hemophilic son, and mothers with one hemophilic son 
and other hemophilic relatives). This group was called the obligatory carriers. The 
pairs of observations (x), x2) for the two groups are plotted in Figure 11.4. Also 
shown are estimated contours containing 50% and 95% of the probability for 
bivariate normal distributions centered at x, and x2, respectively. Their common 
covariance matrix was taken as the pooled sample covariance matrix Spootea- 10 this 
example, bivariate normal distributions seem to fit the data fairly well. 
The investigators (see [4]) provide the information 


_ _ | -.0065 _ _ | -.2483 
te caggoo'; YA I pee 


3As the sample sizes increase, X,,%2, and Spootea become, with probability approaching 1, indistin- 
guishable from 411, #2, and &, respectively [see (4-26) and (4-27)]. 

4At the very least, the marginal frequency distributions of the observations on each variable can be 
checked for normality. This must be done for the samples from both populalions. Often, some variables 
must be transformed in order to make them more “normal looking.” (See Sections 4.6 and 4.8.) 


588 Chapter 11 Discrimination and Classification 


X,=1og |) (AHF-like antigen) 
A 


pI 
a 
2 
ape 
Or : 
Xx, 
=1- 
~2- @ Normals 
-3r © Obligatory carriers 
{ GS Kea TESS! Vcc eee ee Lamm | ee x, =log,9 (AHF activity) 
—.7 ~—.5 -3 = | 3 


Figure 11.4 Scatter plots of [logj9(AHF activity), log,)(AHF-like antigen)] for the 
normal group and obligatory hemophilia A carriers. 


and 
ee 131.158 —90.423 
pooled "| —90.423 108.147 


Therefore, the equal costs and equal priors discriminant function [see (11-19)] is 


a 


y= a’x = [, — %]'SooeaX 

131.158 -90.423 X} 
eens ee Ee a7! =| 
37.61x, — 28.92x2 


1 


HI 


Moreover, 


. ~.006 
jy = a'X, = [37.61 —28.92] E pied = 88 


= —10.10 


—.2483 
.0262 


jy = a'X2 = [37.61 —28.92] | 
and the midpoint between these means [see (11-20)] is 
m = 3(¥ + jy) = $(.88 — 10.10) = -4.61 
Measurements of AHF activity and AHF-like antigen on a woman who may be 
a hemophilia A carrier give x; = —.210 and x, = —.044. Should this woman be clas- 


sified as a, (normal) or 72 (obligatory carrier)? 
Using (11-18).with equal costs and equal priors so that In(1) = 0, we obtain 


Allocate xq to 7, if jp = 4’X9 = m = —4.61 


Allocate xg to 77 if Jo = a'x9 < m = -4.61 


Classification with Two Multivariate Normal Populations 589 


where x’9 = [—.210, —.044]. Since 


~.210 


Jo = Axo = [37.61 28.92] fe ail 


| = —6.62 < —4.61 

we classify the woman as 772, an obligatory carrier. The new observation is indicated 
by a star in Figure 11.4. We see that it falls within the estimated .50 probability con- 
tour of population 72 and about on the estimated .95 probability contour of popula- 
tion 7,. Thus, the classification is not clear cut. 

Suppose now that the prior probabilities of group membership are known. For 
example, suppose the blood yielding the foregoing x; and x, measurements is drawn 
from the maternal first cousin of a hemophiliac. Then the genetic chance of being a 
hemophilia A carrier in this case is .25. Consequently, the prior probabilities of 
group membership are p, = .75 and py = .25. Assuming, somewhat unrealistically, 
that the costs of misclassification are equal, so that c(112) = c(2!1), and using the 
classification statistic 


= (1 — ¥2)'SphreaXo — $ (1 — e)' Sphotea (Hi + ¥2) 
or & = a’xg — m with x’g = [—.210, -.044],m = —4.61, and a'xo = —6.62, we 
have 
t = -6.62 - (~4.61) = -2.01 
Applying (11-18), we see that 


& 2 ot in| 2 |S in| S110 
Pi 15 


and we classify the woman as 72, an obligatory carrier. = 


Scaling 


The coefficient vector 4 = Sroolea (x1 — X,) is unique only up to a multiplicative 
constant, so, for c ¥ 0, any vector ca will also serve as discriminant coefficients. 

The vector a is frequently “scaled” or “normalized” to ease the interpretation of 
its elements. Two of the most commonly employed normalizations are 


1. Set 


ar = — (11-21) 
a’a ; 
so that 4* has unit length. 
2. Set 
ata (11-22) 
a 


so that the first element of the new coefficient vector a* is 1. 


In both cases, a* is of the form ca. For normalization (1), c = (a’a) ” and 
for (2),c = aj). 


590 Chapter 11 Discrimination and Classification 


The magnitudes of @}, @3,...,@, in (11-21) all lie in the interval [~1,1]. In 
(11-22), af = 1 and @,...,a, are expressed as multiples of a}. Constraining the 4? 
to the interval [—1, 1] usually facilitates a visual comparison of the coefficients. Sim. 
ilarly, expressing the coefficients as multiples of a; allows one to readily assess the 
relative importance (vis-a-vis X,) of variables X3,...,X, p as discriminators. 

Normalizing the a;’s is recommended only if the X variables have been stan. 
dardized. If this is not the case, a great deal of care must be exercised in interpreting 
the results. 


Fisher’s Approach to Classification with Two Populations 


Fisher [10] actually arrived at the linear classification statistic (11-19) using an en- 
tirely different argument. Fisher’s idea was to transform the multivariate observa- 
tions x to univariate observations y such that the y’s derived from population 77, and 
a2, Were separated as much as possible. Fisher suggested taking linear combinations 
of x to create y’s because they are simple enough functions of the x to be handled 
easily. Fisher’s approach does not assume that the populations are normal. It does, 
however, implicitly assume that the population covariance matrices are equal. be- 
cause a pooled estimate of the common covariance matrix is used. 

A fixed linear combination of the x’s takes the values y;1, yi2,---, 1 n, for the 
observations from the first population and the values y;, »2,..-, Yan, for the obser- 
vations from the second population. The separation of these two sets of univariate 
y’s is assessed in terms of the difference between y, and j). expressed in standard 
deviation units. That is, 
ny m 
h-l (nj — AP + DOr ~ Be)? 

M28) where = in Sa eee ee 
Sy : 


separation = 
P ny + nz — 2 

is the pooled estimate of the variance. The objective’is to select the linear combina- 

tion of the x to achieve maximum separation of the sample means y, and j. 


Result t1.3. The linear combination y = 4’x = (x; — %) SpooieaX maximizes the 
ratio 


between sample means of , (yi - yy) 


2 
y 


( squared distance 


(sample variance of y) s 


_ (Wy ~ i)? 


a’ Spooled a 


_ (aay 


= x 11-23 
a’Spooted 4 ( ) 


over all possible coefficient vectors 4 where d = (X; — X2). The maximum of the 
ratio (11-23) is D? = (%1 — ¥2)'Spociea(X1 — X2). 


Classification with Two Multivariate Normal Populations 591 


Proof. The maximum of the ratio in (11-23) is given by applying (2-50) directly. 
Thus, setting d = (xX) — X2), we have 


(a'd)? — are-l =. ts = \le-l a _ = 2 
max A'S. —z=d Spootea d = (%1 — X2) Spooted (%1 = X>) =D 
§ a pooled 4 
where D? is the sample squared distance between the two means. = 


Note that s? in (11-33) may be calculated as 


ua ~-2, oe 8 

> (uj— ny + > (oj ~ De) 
225 ums 11-24 
*y ny + ny -—2 ( ) 


with yj = aX); and y; = a’xp;. 


Example 11.4 (Fisher’s linear discriminant for the hemophilia data) Consider the 
detection of hemophilia A carriers introduced in Example 11.3. Recall that the equal 
costs and equal priors linear discriminant function was 


y = a'x = (%) — X2)'SpooleaX = 37.612, — 28.92x, 


This linear discriminant function is Fisher’s linear function, which maximaily 
separates the two populations, and the maximum separation in the samples is 


D? = (1 — ¥2)'Sootea (1 — ¥2) 
131.158 -90.423 2418 
~90.423 en | 
= 10.98 = 


= [.2418, —.0652] | 


Fisher’s solution to the separation problem can also be used to classify new 
observations. 


An Allocation Rule Based on Fisher’s Discriminant Function® 


Allocate x9 to 7, if 


So = (%1 — %2)’Srootea Xo 


= m= 5 (1 — X)/Srootea(%1 + Xp) 
or (11-25) 
J —-m=O 
Allocate xp to 772 if 
Yo<m 
or 
y-m<O0 


5 We must have (7 + nz — 2) = p; otherwise Spooiea is singular, and the usual inverse, Spooled» dOes 
not exist. 


592 Chapter 11 Discrimination and Classification 


Figure 11.5 A pictorial representation of Fisher’s procedure for two populations 
with p = 2. 


The procedure (11-23) is illustrated, schematically, for p = 2 in Figure 11.5. All 
points in the scatter plots are projected onto a line in the direction a, and this direc- 
tion is varied until the samples are maximally separated. 

Fisher’s linear discriminant function in (11-25) was developed under the as- 
sumption that the two populations, whatever their form, have a common covariance 
matrix. Consequently, it may not be surprising that Fisher’s method corresponds to 
a particular case of the minimum expected-cost-of-misclassification rule. The first 
term, y = (%1 — X2)'Spooted x, in the classification rule (11-18) is the linear function 
obtained by Fisher that maximizes the univariate “between” samples variability rel- 
ative to the “within” samples variability. [See (11-23).] The entire expression 


W = (% — X2)/SpooreaX — 3(% - X2)'Spootea (1 + ¥2) 
= (€1 — %)'Sphea[x — 3 (Ki + %)| (11-26) 


is frequently called Anderson’s classification function (statistic). Once again, if 
[(c(112)/c(2!1))(p2/P1)] = 1, so_ that In[(c(112)/e(2!1))(p2/p1)] = 0, Rule 
(11-18) is comparable to Rule (11-26), based on Fisher’s linear discriminant func- 
tion. Thus, provided that the two normal populations have the same covariance ma- 
trix, Fisher’s classification rule is equivalent to the minimum ECM rule with equal 
prior probabilities and equal costs of misclassification. 


Is Classification a Good Idea? 


For two populations, the maximum relative separation that can be obtained by 
considering linear combinations of the multivariate observations is equal to the 
distance D*. This is convenient because D? can be used, in certain situations, to test 
whether the population means jy; and 422 differ significantly. Consequently, a test 
for differences in mean vectors can be viewed as a test for the “significance” of the 
separation that can be achieved. 


Classification with Two Multivariate Normal Populations 593 


Suppose the populations 77, and 72 are multivariate normal with a common co- 
variance matrix X. Then, as in Section 6.3, a test of Ho: wy = fz versus Ay: fy * P22 
is accomplished by referring 


n+nm—p-i mm _\ 
(nm, + m2 -2)p / \m +m 
to an F-distribution with vy, = p and v, = n, + mn, — p — 1 df If Hp is rejected, 


we can conclude that the separation between the two populations 7, and 72 is 
significant. 


Comment. Significant separation does not necessarily imply good classifica- 
tion. As we shall see in Section 11.4, the efficacy of a classification procedure can be 
evaluated independently of any test of separation. By contrast, if the separation is 
not significant, the search for a useful classification rule will probably prove 
fruitless. 


Classification of Normal Populations When &, # >, 


As might be expected, the classification rules are more complicated when the popu- 
lation covariance matrices are unequal. 

Consider the multivariate normal densities in (11-10) with Z;, i = 1,2, replac- 
ing &. Thus, the covariance matrices, as well as the mean vectors, are different from 
one another for the two populations. As we have seen, the regions of minimum 
ECM and minimum total probability of misclassification (TPM) depend on the 
ratio of the densities, f,(x)/f:(x), or, equivalently, the natural logarithm of the den- 
sity ratio, In[f,(x)/fo(x)] = In[f,(x)] — In[f:(x)]. When the multivariate normal 
densities have different covariance structures, the terms in the density ratio involv- 
ing | X, [1/2 do not cancel as they do when Z = 2. Moreover, the quadratic forms in 
the exponents of f,(x) and f,(x) do not combine to give the rather simple result in 
(11-13). 

Substituting multivariate normal densities with different covariance matrices 
into (11-6) gives, after taking natural logarithms and simplifying (see Exercise 
11.15), the classification regions 


1 112 
Ry ~5x'(Ei! ~ Ya")x + (midi! ~ hBz)x— k= | (4 ) (2)| 


c(211) 
. _i U -1_ yr ty-l _ ‘srl a c(112) P2 
R: 3* (2) 2 )x + (e121 3X2 )x k <In c(211) Pr 
(11-27) 
where 
1, (|z ih te - 
k= 5ln (34) oh 3 (ize — #523" p2) (11-28) 


The classification regions are defined by quadratic’‘functions of x. When 2% = X2, 
the quadratic term, —$x'(Z;) — %')x, disappears, and the regions defined by 
(11-27) reduce to those defined by (11-14). 


594 Chapter 11 Discrimination and Classification 


The classification rule for general multivariate normal populations follows 
directly from (11-27). 


Result 11.4. Let the populations 7, and 7 be described by multivariate norma| 
densities with mean vectors and covariance matrices ex}, %, and pz, Xo, respec- 
tively. The allocation rule that minimizes the expected cost of misclassification js 
given by 


Allocate xg to 7 if 


1 ie 112 
~ 5 X0( Ei = X7!)xo + (he = 1527!) xp -ke2 in| (20) (2)| 


Allocate xo to 7r2 otherwise. 


Here k is set out in (11-28). = 


In practice, the classification rule in Result 11.5 is implemented by substituting 
the sample quantities x,, X), S,, and S, (see (11-16)) for #1, #2, 21, and Xp, 
respectively.® 


Quadratic Classification Rule 
(Normal Populations with Unequal Covariance Matrices) 


Allocate xq to 77, if 


112 
~ 5 x4(Si! — $5!)xo + (Xi, Sy! — %87!)xo -— kK = in| (02) (2) | 
(11-29) 


Allocate xo to 772 otherwise. 


Classification with quadratic functions is rather awkward in more than two di- 
mensions and can lead to some strange results. This is particularly true when the 
data are not (essentially) multivariate normal. 

Figure 11.6(a) shows the equal costs and equal priors rule based on the ideal- 
ized case of two normal distributions with different variances. This quadratic rule 
leads to a region R, consisting of two disjoint sets of points. 

In many applications, the lower tail for the 7 distribution will be smaller 
than that prescribed by a normal distribution. Then, as shown in Figure 11.6(b), 
the lower part of the region R;, produced by the quadratic procedure, does not 
line up well with the population distributions and can lead to large error rates. 
A serious weakness of the quadratic rule is that it is sensitive to departures from 
normality. 


® The inequalities n, > p and n) > p must both hold for Sj! and Sz! to exist. These quantilies are 
used in place of £} and 3, respectively, in the sample analog (11-29). 


Classification with Two Multivariate Normal Populations 595 


(6) 


Figure 11.6 Quadratic rules for (a) two normal distribution with unequal variances 
and (b) two distributions, one of which is nonnormal—rule not appropriate. 


If the data are not multivariate normal, two options are available. First, the non- 
normal data can be transformed to data more nearly normal, and a test for the 
equality of covariance matrices can be conducted (see Section 6.6) to see whether 
the linear rule (11-18) or the quadratic rule (11-29) is appropriate. Transformations 
are discussed in Chapter 4. (The usual tests for covariance homogeneity are greatly 
affected by nonnormality. The conversion of nonnormal data to normal data must 
be done before this testing is carried out.) 

Second, we can use a linear (or quadratic) rule without worrying about the form 
of the parent populations and hope that it will work reasonably well. Studies (see 
[22] and [23]) have shown, however, that there are nonnormal cases where a linear 
classification function performs poorly, even though the population covariance ma- 
trices are the same. The moral is to always check the performance of any classifica- 
tion procedure. At the very least, this should be done with the data sets used to build 
the classifier. Ideally, there will be enough data available to provide for “training” 
samples and “validation” samples. The training samples can be used to develop 
the classification function, and the validation samples can be used to evaluate its 
performance. 


596 Chapter 11 Discrimination and Classification 


11.4 Evaluating Classification Functions 


One important way of judging the performance of any classification procedure is to 
calculate its “error rates,” or misclassification probabilities. When the forms of the 
parent populations are known completely, misclassification probabilities can be cal- 
culated with relative ease, as we show in Example 11.5. Because parent populations 
are rarely known, we shall concentrate on the error rates associated with the sample 
classification function. Once this classification function is constructed, a measure of 


its performance in future samples is of interest. 
From (11-8), the total probability of misclassification is 


TPM = py; i faim to ij r(x) dx 


The smallest value of this quantity, obtained by a judicious choice of R, and R, is 
called the optimum error rate (OER). 


Optimum error rate (OER) = p; } fi(x) dx + pr. | fo(x) dx (11-30) 
R2 R 


where R and R» are determined by case (b) in (11-7). 
Thus, the OER is the error rate for the minimum TPM classification rule. 


Example 11.5 (Calculating misclassification probabilities) Let us derive an expres- 
sion for the optimum error rate when p; = p) = } and f\(x) and f)(x) are the mul- 
tivariate norma] densities in (11-10). 

Now, the minimum ECM and minimum TPM classification rules coincide when 
c(1!2) = c(211). Because the prior probabilities are also equal, the minimum 
TPM classification regions are defined for normal populations by (11-12), with 


| (carn) (3) J-2meemone 


Ry (oy — or)" D TK — F(t ~ w2)'E My + Me) = 0 

Ro: (My ~ #2)'EK — 3 (oer ~ yer)"Z "(+ oe) < 0 

These sets can be expressed in terms of y = (1 ~ 2)'Z 1x = a’x as 
Ri(y): y= 3 (He ~ H2)'E "(my + m2) 
Roy): y <3 (Hr — Ma)'E Mes + #2) 


But Y is a linear combination of normal random variables, so the probability densi- 
ties of Y, f,(y) and f(y), are univariate normal (see Result 4.2) with means and a 
variance given by 


Byy = a" py = (M1 ~ M2) Dy 
boy = a’ Hz = (Hi — Ba)’ Ey 


oy = a’ Za = (My — M2)'E "(my ~ pp) = A? 


Evaluating Classification Functions 597 


fl) = Nyy?) FQ) = Mu, 4?) 


KK S 


Boy Hiy 


Figure 11.7 The misclassification probabilities based on Y. 


Now, 
TPM = ip [misclassifying a 77, observation as 77] 
+ 1P[misclassifying a 7, observation as 7] 
But, as shown in Figure 11.7 
P[misclassifying a 77; observation as 72] = P(211) 
PLY <3 (m1 — #2)'E"(m + H2)] 
= (Zs Biy 5 (Ha — H2)'2 (my + Ha) — (41 ~ nes) 
oy 


= P(z < =) = o() 


where ® (-) is the cumulative distribution function of a standard normal random 
variable. Similarly, 


P[misclassifying a 772 observation as 77} 
= P(112) = PLY = 3 (my — #2)'E mr + o2)] 


-»(z=4)=1-0($)-0() 


Therefore, the optimum error rate is 
1 —A 1 —A -A 
= ini =~ -—— _- —_—_ = creer 11-31 
OER = minimum TPM +o( 5 \+ +0 5 ) o( 5 ) ( ) 


If, for example, A? = (ge, — w2)'27'(ey — wz) = 2.56, then A = V2.56 = 1.6, and, 
using Table 1 in the appendix, we obtain 


Minimum TPM = o(-3°) = @(—.8) = 2119 


The optimal classification rule here will incorrectly allocate about 21% of the items 
to one population or the other. = 


Example 11.5 illustrates how the optimum error rate can be calculated when the 
population density functions are known. If, as is usually the case, certain population 


598 Chapter 11 Discrimination and Classification 


parameters appearing in allocation rules must be estimated from the sample, thep 
the evaluation of error rates is not straightforward. 

The performance of sample classification functions can, in principle, be evaluat. 
ed by calculating the actual error rate (AER), 


AER = pi i fi(x) dx + pp ie fox) dx (11-32) 


where R and Ry represent the classification regions determined by samples of size 
n, and np, respectively. For example, if the classification function in (11-18) is 
employed, the regions R; and R> are defined by the set of x’s for which the following 
inequalities are satisfied. 


a 


to 2 ) c(112 
Ri: (&, — €2)/Spootea® — 5% — %)/Srootea (¥1 + Xz) = in| (0) (=| 


R fae a °. c(1! 
Ry: (%, — ¥2)'SpooleaX — 5% ~ ¥2)'Syootea (%1 + X) < in| (252) (2)| 


The AER indicates how the sample classification function will perform in future 
samples. Like the optimal error rate, it cannot, in general, be calculated, because it 
depends on the unknown density functions f,(x) and f;(x). However, an estimate of 
a quantity related to the actual error rate can be calculated, and this estimate will be 
discussed shortly. 

There is a measure of performance that does not depend on the form of the 
parent populations and that can be calculated for any classification procedure. This 
measure, called the apparent error rate (APER), is defined as the fraction of observa- 
tions in the training sample that are misclassified by the sample classification function. 

The apparent error rate can be easily calculated from the confusion matrix, 
which shows actual versus predicted group membership. For n, observations from 
a, and nz observations from 72, the confusion matrix has the form 


Predicted membership 


Ty 72 
Actual Ty Nic Niyv=n- Nic ny (11-33) 
membership 7, Nom = Nz —- Nc Nec | nz 


where 
N,c = number of 7, items correctly classified as zr, items 
n,y = number of 77, items misclassified as 77, items 
nc = number of 72 items correctly classified 


Hl 


n2y = number of 72 items misclassified 


The apparent error rate is then 
+ 
APER = 14 "2m (11-34) 
ny + nz 


which is recognized as the proportion of items in the training set that are misclassified. 


Evaluating Classification Functions 599 


Example 11.6 (Calculating the apparent error rate) Consider the classification re- 
gions R, and R, shown in Figure 11.1 for the riding-mower data. In this case, obser- 
vations northeast of the solid line are classified as 7,, mower owners; observations 
southwest of the solid line are classified as 7, nonowners. Notice that some obser- 
vations are misclassified. The confusion matrix is 


Predicted membership 


7: riding-mower owners 772: nOnowners 


riding- 
71: Mower nc = 10 Ny =2 
Actual owners 
membership 


72: nonowners 


The apparent error rate, expressed as a percentage, is 


2+2 s 4 . 7 
oy pee hy fea = 16. = 
APER (3 7 =) 100% (4) 100% 16.7% 


The APER is intuitively appealing and easy to calculate. Unfortunately, it tends 
to underestimate the AER, and the problem does not disappear unless the sample 
sizes n, and ny are very large. Essentially, this optimistic estimate occurs because the 
data used to build the classification function are also used to evaluate it. 

Error-rate estimates can be constructed that are better than the apparent error 
rate, remain relatively easy to calculate, and do not require distributional assump- 
tions. One procedure is to split the total sample into a training sample and a valida- 
tion sample. The training sample is used to construct the classification function, and 
the validation sample is used to evaluate it. The error rate is determined by the pro- 
portion misclassified in the validation sample. Although this method overcomes the 
bias problem by not using the same data to both build and judge the classification 
function, it suffers from two main defects: 


(i) It requires large samples. 

(ii) The function evaluated is not the function of interest. Ultimately, almost all of 
the data must be used to construct the classification function. If not, valuable in- 
formation may be lost. 

A second approach that seems to work well is called Lachenbruch’s “holdout” 

procedure’ (see also Lachenbruch and Mickey [24]): 

1. Start with the 7, group of observations. Omit one observation from this 
group, and develop a classification function based on the remaining n, ~ 1, n2 
observations. 


2. Classify the “holdout” observation, using the function constructed in Step 1. 


7Lachenbruch’s holdout procedure is sometimes referred to as jackknifing or cross-validation. 


600 Chapter 11 Discrimination and Classification 


3. Repeat Steps 1 and 2 until all of the 7, observations are classified. Let nf) he 
the number of holdout (#4) observations misclassified in this group. 


4. Repeat Steps 1 through 3 for the zr, observations. Let nz ye, be the number of 
holdout observations misclassified in this group. 


Estimates P2 11) and P(1 |2) of the conditional misclassification probabilities 
in (11-1) and (11-2) are then given by 
ee 
P(211 
(211) = a 
ny 
PQ 12) =— (11-35) 
Ny ; 
and the total proportion misclassified, (nf) M+ n(n + ny), is, for moderate 
samples, a nearly unbiased estimate of the expected actual error rate, E(AER). 


(4), (H) 
A + 
E(AER) = 71M 7 7am (11-36) 
ny + ny 


Lachenbruch’s holdout method is computationally feasible when used in con- 
junction with the linear classification statistics in (11-18) or (11-19). It is offered as 
an option in some readily available discriminant analysis computer programs. 


Example 11.7 Calculating an estimate of the error rate using the holdout procedure) 
We shall illustrate Lachenbruch’s holdout procedure and the calculation of error 
rate estimates for the equal costs and equal priors version of (11-18). Consider the 
following data matrices and descriptive statistics. (We shall assume that the 
n, = n, = 3 bivariate observations were selected randomly from two populations 
7, and 72 with a common covariance matrix.) 


>, 


X, = 


The pooled covariance matrix is 


1 1 -1 
Spooled a= 4 1 + 28) = 1 4 


Using Spooieas the rest of the data, and Rule (11-18) with equal costs and equal pri- 
ors, we may classify the sample observations. You may then verify (see Exercise 
11.19) that the confusion matrix is 


Evaluating Classification Functions 601 


Classify as: 


W) W2 
7 4 Wy 2 1 
True population: es 1 2 


and consequently, 


APER(apparent error rate) = 2 = .33 


Holding out the first observation x‘y = [2,12] from Xj, we calculate 


4 10 = 3.5 5 1 
Xiw=(' ot n= (2b and 18,4 = E | 


The new pooled covariance matrix, $x pooted, iS 


I 1/25 -1 
SH, pooled = 3 USix + 282] = 1/2 


1 10 
Si 1f10 1 
Si.pooled = 3] 1 25 


It is computationally quicker to classify the holdout observation x, on the basis 
of its squared distances from the group means x, H and X X,. This procedure is equivalent 
to computing the value of the linear function y = ayxy = Gin ~ X)'S7 “pooled XH 
and comparing it to the midpoint my = $(%14 — X2)'S# “pooled (Xi + X>). [See 
(11-19) and (11-20).] 

Thus with x}, = [2, 12] we have 


with inverse® 


Squared distance from % y = (X47 — %14)'SH pooted(XH — ¥1H) 


10 1 ][ 2 35 
de “ ~ 145 
=([2-35 12-9]= ak | Me a | 4 


Squared distance from x2 = (xy — ¥>)'SHpootea (XH — X)) 


ifio 1][ 2 -4 
Sites a9] = = 103 
aac: 3] 5 FE a 


Since the distance from xj to X; is smaller than the distance from x, to X2, we 
classify xj; as a 7, observation. In this case, the classification is correct. 
If xy = [4, 10] is withheld, x), and S7.pootea become 


as 2.5 = 1} 16 4 
Xin = 133 | and S_pooled =i 4 S| 


8 matrix identity due to Bartlett [3] allows for the quick calculation of S# poote directly from Srloted- 
Thus one does not have to recompute the inverse after withholding each observation. (See Exercise 11.20.) 


602 Chapter 11 Discrimination and Classification 


We find that 
a) sas es 16 4 4-25 
(xu — ¥1H)'SH-pootea(XH — ¥in) = [4-25 10 - 10]> “f if pal E ne Al 
=45 
ees ES 1}16 4 4-4 
(xy — X2) Sif. pootea (XH —%&)= [4 ~4 10- nig} 4 | be & | 
=28 


and consequently, we would incorrectly assign x}, = [4,10] to 72. Holding out 


x}, = [3,8] leads to incorrectly assigning this observation to 7 as well. Thus, 


(H) 
Niu = 2. 
Turning to the second group, suppose x}; = [5,7] is withheld. Then 


5 2 
Xin=[2 aL o=|* ‘| and ise | 3 i 


The new pooled covariance matrix is 


1 1) 2.5 —4 
Si_poolea = 3 [281 + 18H] = p38 * 


1 3/16 4 
S77. pooted = 4 4 25 
We find that 


ee, 146 4 5-3 
(xq — X1) SH,poolea(XH ~ %1) = [5-3 7- 10) 4 25//7-10 
=48 
oe, - 316 4 ][5-35 
(xu — X24) Si pootea(XH ~ X2H) = [5-35 7 — 4 4 25|| 7-7 


=45 


with inverse 


and x}, = [5,7] is correctly assigned to 77. 


When x}, = [3,9] is withheld, 
3]10 1 3-3 
Be Flag 7-95 | ox 10 


= 3 


cues a 3/10 1 3-45 
(xq = X27) Si pooled (XH = XH) an [3 245 2 a3) 1 sl 3 -6 | 


=45 


" 


(xy ¥1)'SH pooled (XH = X;) 


and x}, = [3,9] is incorrectly assigned to 77). Finally, punioliing xy = [4,5] leads 
to correctly classifying this observation as 72. Thus, ni) =1. 


Evaluating Classification Functions 603 


An estimate of the expected actual error rate is provided by 


ni) + ns) 2+1 


nyt nz ae a 


E(AER) = 


Hence, we see that the apparent error rate APER = .33 is an optimistic measure of 
performance. Of course, in practice, sample sizes are larger than those we have 
considered here, and the difference between APER and E(AER) may not be as 
large. = 


If you are interested in pursuing the approaches to estimating classification 
error rates, see [23]. 

The next example illustrates a difficulty that can arise when the variance of the 
discriminant is not the same for both populations. 


Example 11.8 (Classifying Alaskan and Canadian salmon) The salmon fishery is a 
valuable resource for both the United States and Canada. Because it is a limited 
resource, it must be managed efficiently. Moreover, since more than one country is 
involved, problems must be solved equitably. That is, Alaskan commercial fishermen 
cannot catch too many Canadian salmon and vice versa. 

These fish have a remarkable life cycle. They are born in freshwater streams 
and after a year or two swim into the ocean. After a couple of years in salt water, 
they return to their place of birth to spawn and die. At the time they are about to 
return as mature fish, they are harvested while still in the ocean. To help regulate 
catches, samples of fish taken during the harvest must be identified as coming 
from Alaskan or Canadian waters. The fish carry some information about their 
birthplace in the growth rings on their scales. Typically, the rings associated with 
freshwater growth are smaller for the Alaskan-born than for the Canadian-born 
salmon. Table 11.2 gives the diameters of the growth ring regions, magnified 100 
times, where 


X, = diameter of rings for the first-year freshwater growth 
(hundredths of an inch) 


X2 = diameter of rings for the first-year marine growth 
(hundredths of an inch) 


In addition, females are coded as 1 and males are coded as 2. 
Training samples of sizes n, = 50 Alaskan-born and n, = 50 Canadian-born 
salmon yield the summary statistics 


_ _ [| 98380 5, | 260.608 -188.093 
1 ™ | 429.660 |’ 1” | ~188.093 1399.086 


_ _ | 137.460 g, — | 326090 133.505 
%2™ | 366.620 |) -°2~'| 133.505 893.261 


604 Chapter 11 Discrimination and Classification 


Table 11.2 Salmon Data (Growth-Ring Diameters) 


Canadian 
Gender Freshwater Marine Gender Freshwater Marine 


368 129 
148 
179 
152 


NPN RENN FP NNER NE NEF NENNNEP ENE NN EE NFP ENNFNEPNNNE HE 


2 
1 
1 
2 
1 
2 
1 
2 
2 
1 
1 
2 
1 
2 
2 
1 
2 
2 
2 
1 
1 
2 
2 
2 
2 
2 
1 
2 
1 
2 
1 
1 
1 
1 
1 
1 
1 
1 
2 
1 
2 
2 
1 
2 
1 


(continues on next page) 


Evaluating Classification Functions 605 


Table 11.2 (continued) 


Alaskan Canadian 
Gender Freshwater Marine Gender Freshwater Marine 
1 95 433 2 150 339 
2 92 404 2 124 341 
1 99 481 1 125 346 
2 94 491 1 153 352 
1 87 480 1 108 339 


Gender Key:1 = female; 2 = male. 
Source: Data courtesy of K. A. Jensen and B. Van Alen of the State of Alaska Department of Fish and Game. 


The data appear to satisfy the assumption of bivariate normal distributions (see 
Exercise 11.31), but the covariance matrices may differ. However, to illustrate a point 
concerning misclassification probabilities, we will use the linear classification procedure. 

The classification procedure, using equal costs and equal prior probabilities, 
yields the holdout estimated error rates 


Predicted membership 


m,: Alaskan 2: Canadian 
peer mAlaskn [__@ [6] 
membership my Canadian |_| 


based on the linear classification function [see (11-19) and (11-20)] 
w= ¥ — m = —5.54121 — .12839x, + .05194x2 


There is some difference in the sample standard deviations of w for the two 
populations: 


Sample Sample 
n Mean Standard Deviation 
Alaskan 50 4.144 3.253 
Canadian 50 —4.147 2.450 


Although the overall error rate (7/100, or 7%) is quite low, there is an unfair- 
ness here. It is less likely that a Canadian-born salmon will be misclassified as 
Alaskan born, rather than vice versa. Figure 11.8, which shows the two normal 
densities for the linear discriminant ), explains this phenomenon. Use of the 


y 
¥2 it yy 


Figure | 1.8 Schematic of normal densities for linear discriminant—salmon data. 


606 Chapter 11 Discrimination and Classification 


midpoint between the two sample means does not make the two misclassification 
probabilities equal. It clearly penalizes the population with the largest variance. 
Thus, blind adherence to the linear classification procedure can be unwise. = 


It should be intuitively clear that good classification (low error rates) will de- 
pend upon the separation of the populations. The farther apart the groups, the more 
likely it is that a useful classification rule can be developed. This separative goal, al. 
luded to in Section 11.1, is explored further in Section 11.6. 

As we shall see, allocation rules appropriate for the case involving equal prior 
probabilities and equal misclassification costs correspond to functions designed to 
maximally separate populations. It is in this situation that we begin to lose the dis- 
tinction between classification and separation. 


11.5 Classification with Several Populations 


In theory, the generalization of classification procedures from 2 to g = 2 groups is 
straightforward. However, not much is known about the properties of the corre- 
sponding sample classification functions, and in particular, their error rates have not 
been fully investigated. 

The “robustness” of the two group linear classification statistics to, for instance, 
unequal covariances or nonnormal distributions can be studied with computer gen- 
erated sampling experiments.’ For more than two populations, this approach does 
not lead to general conclusions, because the properties depend on where the popu- 
lations are located, and there are far too many configurations to study conveniently. 

As before, our approach in this section will be to develop the theoretically opti- 
mal rules and then indicate the modifications required for real-world applications 


The Minimum Expected Cost of Misclassification Method 


Let f(x) be the density associated with population 7,,i = 1,2,..., g. [For the most 
part, we shall take f(x) to be a multivariate normal density, but this is unnecessary 


for the development of the general theory.] Let 
P; = the prior probability of populationa;, i = 1,2,...,g¢ 
c(Kli) = the cost of allocating an item to 7, when, in fact, it belongs 
tom;, fork,é = 1,2,...,g 
For k = i, c(ili) = 0. Finally, let R, be the set of x’s classified as 7, and 


P(kli) = P(classifying item as 77,1 7;) = [se dx 
Ry 


4 
for k,i = 1,2,...,g with P(ili) =1—- § P(kli). 
k=l 


k#i 


> Here robustness refers to the deterioration in error rates caused by using a classification procedure 
with data that do not conform to the assumptions on which the procedure was based. 

It is very difficult to study the robustness of classification procedures analytically. However, data 
from a wide variety of distributions with different covariance structures can be easily generated 
on a computer. The performance of various classification rules can then be evaluated using computer- 
generated “samples” from these distributions. 


Classification with Several Populations 607 


The conditional expected cost of misclassifying an x from 7 into 72, or 7r3,.. 
OF 7, is 


ECM(1) = P(2/1)c(211) + P(311)c(311) +++ + P(gi1)e(g!1) 
= S P(k\1)e(k 11) 
k=2 


This conditional expected cost occurs with prior probability p,, the probability of 71. 

In a similar manner, we can obtain the conditional expected costs of misclassifi- 
cation ECM(2),..., ECM(g). Multiplying each conditional ECM by its prior prob- 
ability and summing gives the overall ECM: 


ECM = p,ECM(1) + pECM(2) + ++: + pgECM(g) 


Hl 


é g 
nS P(kit)e(KI1) ) + po| > P(k12)c(k12) 
k=2 k=1 


k#2 
=1 
Aviat Pe( 5, Plkig)e(k'g) 


& 


> Pi S Plklide(ki) (11-37) 
k=1 


i=1 
k#i 


Determining an optimal classification procedure amounts to choosing the mu- 
tually exclusive and exhaustive classification regions R,,R2,...,R, such that 
(11-37) is a minimum. 


Result 11.5. The classification regions that minimize the ECM (11-37) are defined 
by allocating x to that population 7,,k = 1,2,...,g, for which 


y Pif(x) c(kli) (11-38) 
i=l 


i#k 
is smallest. If a tie occurs, x can be assigned to any of the tied populations. 
Proof. See Anderson [2]. = 


Suppose all the misclassification costs are equal, in which case the minimum expected 
cost of misclassification rule is the minimum total probability of misclassification 
rule. (Without loss of generality, we can set all the misclassification costs equal to 1.) 
Using the argument leading to (11-38), we would allocate x to that population 
w,,k =1,2,..., 8, for which 


> Difi(x) (11-39) 
i=1 


i#k 


608 Chapter 11 Discrimination and Classification 


is smallest. Now, (11-39) will be smallest when the omitted term, p,f;,(x), is largese, 
Consequently, when the misclassification costs are the same, the minimum expected 
cost of misclassification rule has the following rather simple form. 


Minimum ECM Classification Rule 
with Equal Misclassification Costs 


Allocate Xo to ar, if 
Pr fx(x) > pif(x) foralli# k (11-40) 
or, equivalently, 
Allocate xo to 7, if 
In pg f(x) > In p, f(x) foralli ¥ k (11-41) 


It is interesting to note that the classification rule in (11-40) is identical to the 
one that maximizes the “posterior” probability P(7,!x) = P (x comes from 7, 
given that x was observed), where 


Pxfi(x) _ __ (prior) x (likelihood) 


P(a,\x) = Q 
S pifi(x) (prior) x (likelihood)] 
i=1 


fork = 1,2,...,g 


(11-42) 


Equation (11-42) is the generalization of Equation (11-9) to g = 2 groups. 

You should keep in mind that, in general, the minimum ECM rules have three 
components: prior probabilities, misclassification costs, and density functions. These 
components must be specified (or estimated) before the rules can be implemented. 


Example 11.9 (Classifying a new observation into one of three known populations) 
Let us assign an observation x, to one of the g = 3 populations 77, , 72, Or 773, given 
the following hypothetical prior probabilities, misclassification costs, and density 


values: 
True population 
uy) 
c(1]1)=0 = e(12) = 500 ~—c(1]3) = 100, 
Classify as: m, | c(2|1)=10 ¢(2|/2)=0 c(2|3) = 50 
c(3[1) = 50 c(3[2)=200  c(3/3) = O 
Prior probabilities: Py = 05 Pp, = .60 Pps = 35 


Densities at xo: fi(Xo) = 01 h(x) = .85 falxo) = 2 


We shail use the minimum ECM procedures. 


Classification with Several Populations 609 


The values of > Pi fi(Xo)c(K li) [see (11-38)] are 
i=l 
i#k 
kK = 1: Pofa(xo)e(112) + p3fa(xo)c(113) 
= (.60) (.85) (500) + (.35) (2) (100) = 325 
k= 2: pifi(xo)c(211) + psf3(xo)c(213) 
= (.05) (.01) (10) + (.35)(2) (50) = 35.055 
kK=3: pyfi(Xo)e(3!1) + Pef(xo)e(312) 
= (.05) (.01) (50) + (.60) (.85) (200) = 102.025 


3 
Since > Difi(Xo)c(K\i) is smallest for k = 2, we would allocate xo to 72. 
jek 
If all costs of misclassification were equal, we would assign xg according to 
(11-40), which requires only the products 
Pifi(xo) = (.05)(.01) = .0005 
P2f2{Xo) = (60) (.85) = .510 


P3f3(Xo) = (.35) (2) = .700 


P3fx(xo) = .700 = p;f(Xo),i = 1,2 


we should allocate xq to 73. Equivalently, calculating the posterior probabilities [see 
(11-42)], we obtain 


Since 


P(mlxo) = _ Philo) 


> Pif(Xo) 
fas (.05) (.01) _ 0005 
~ (05) (.01) + (.60) (.85) + (.35)(2) 1.2105 
P2fr(xo) _ (-60)(.85) 510 i 


P xs is zs = 
(m72IXo) = 3 1.2105 1.2105 


> Pi fi{Xo) 
P3f3(xo) (35) (2) 7 


P(73|x9) = ego 


> Pifilxo) 


We see that Xp is allocated to 773, the population with the largest posterior probability. m™ 


= .0004 


Classification with Normal Populations 
An important special case occurs when the 
1 Ls — 
fix) = exp| —S(x — pi) Si'(x — Bi) |, 
i= 1,2,...,g (11-43) 


1 
(2m)? |x, 11" 


610 Chapter 11 Discrimination and Classification 


are multivariate normal densities with mean vectors ye; and covariance matrices 2;. 
If, further, c(ili) = 0, c(Kti) = 1,& # é (or, equivalently, the misclassification costs 
are all equal), then (11-41) becomes 


Allocate x to m;, if 


1 1 t = 
In px f(x) = In py — () In (27) — 5 In| 2 = 5 (x = py) LEK — py) 
= max In p; fix) (11-44) 


The constant (p/2) !n (27) can be ignored in (11-44), since it is the same for al] 
populations. We therefore define the quadratic discrimination score for the ith 
population to be 

d?(x) = ~}In|¥;[ — 3 (x ~ mi)'E7% — wi) + np; 
i= 1,2,...,g8 (11-45) 
The quadratic score d?(x) is composed of contributions from the generalized 
variance | X,|, the prior probability p;, and the square of the distance from x to the 
population mean y;. Note, however, that a different distance function, with a 
different orientation and size of the constant-distance ellipsoid, must be used for 


each population. 
Using discriminant scores, we find that the classification rule (11-44) becomes 


the following: 


Minimum Total Probability of Misclassification (TPM) Rule 
for Normal Populations—Unequal 2; 


Allocate x to zr, if 
__ the quadratic score df (x) = largest otatin), d§(x),...,d9(x) (11-46) 
where d?(x) is given by (11-45). 


In practice, the yw; and &, are unknown, but a training set of correctly classified 
observations is often available for the construction of estimates. The relevant sam- 
ple quantities for population 7; are 


xX; = sample mean vector 
S; = sample covariance matrix 


and 
n,; = sample size 


The estimate of the quadratic discrimination score d?(x) is then 
dP (x) = ~Fln|8;| — 3(x - &)'S\(x-%) +Inp, §=1,2,...,8 (147): 


and the classification rule based on the sample is as follows: 


Classification with Several Populations 6/1 


Estimated Minimum (TPM) Rule 
for Several Normal Populations—Unequal 2; 


Allocate x to a, if 
the quadratic score d2(x) = largest of d?(x), d¥(x), ae d2(x) (11-48) 
where d?(x) is given by (11-47). 


A simplification is possible if the population covariance matrices, Z;, are equal. 
When 2; = &, fori = 1,2,...,g, the discriminant score in (11-45) becomes 
d2(x) = —}In|=| - 3x’E x + pid lx — Spe py; + In p; 


The first two terms are the same for d(x), d?(x),..., d(x), and, consequently, 
they can be ignored for allocative purposes. The remaining terms consist of a con- 
stant c; = In p; — i p{X' yu; and a linear combination of the components of x. 

Next, define the linear discriminant score 


di(x) = piE x ~ 5 pjE ye; + In p; (11-49) 
fori = 1,2,...,8 


An estimate d,(x) of the linear discriminant score d,(x) is based on the pooled 
estimate of X. 


Spooted = ere 4 emp pou — 1)S, + (mz — 1), +--+ + (mg — 1)S,) 
(11-50) 
and is given by 
d(x) = ©/Spboteax — 3%/Splotea; + In p, (11-51) 
fori = 1,2,...,g 
Consequently, we have the following: 
Estimated Minimum TPM Rule 
for Equal-Covariance Normal Populations 
Allocate x to 7, if 
the linear discriminant score dy (x) = the largest of d;(x), dy (x),-.., d,(x) 
(11-52) 


with d,(x) given by (11-51). 


Comment. Expression (11-49) is a convenient linear function of x. An equivalent 
classifier for the equal-covariance case can be obtained from (11-45) by ignoring the 
constant term, —3in|Z|. The result, with sample estimates inserted for unknown 
population quantities, can then be interpreted in terms of the squared distances 


D3(x) = (x — %)'Spooiea(X — %i) (11-53) 


612 Chapter 11 Discrimination and Classification 


from x to the sample mean vector x;. The allocatory rule is then 
Assign x to the population 7; for which —E DF (x) + In p;is largest (11-54) - 


We see that this rule—or, equivalently, (11-52)—assigns x to the “closest” Popula. - : 
tion. (The distance measure is penalized by In pj.) 


If the prior probabilities are unknown, the usual procedure is to set pj) =p) =--.= 
Pz, = 1/g. An observation is then assigned to the closest population. 


Example 11.10 (Calculating sample discriminant scores, assuming a common covari- 2 
ance matrix) Let us calculate the linear discriminant scores based on data from g = 3 - 
populations assumed to be bivariate normal with acommon covariance matrix, - ~ 

Random samples from the populations 7,7, and 73, along with the sample - 
mean vectors and covariance matrices, are as follows: 


-2 
a _ -1 1 -1 
Wy: X, =", =| i anas, =| 1 1 
-1 1 
1 2 ie 
A anas,= |_| Al 


Ere 0 Al 
73: X; 0 0], son; = 3, n-| ol, and; =| | Al 


Given that p, = pp = .25 and p; = .50, let us classify the observation 
Xo = [X01 X02] = [-2 —1] according to (11-52). From (11-50), 


‘ Pits is ies eae pe I 
poled “9-3/-1 4[°9-3/-1 4] ° 9-3/1 4 


-2| 1+14+1 Beare nag 
6 


tt 
aa) 
Ww 
n 
° 
= 
= 
{ 
ww 


Ty: X, = 


i} 
pe N © 
NaN 
g 
fo) 

3 
vn 
| 
Ww 
bal 
N 
i 


K 


-l1-1+1 44444 


so 


gt _ 1/36 3 
pooled “" 35 35| 3 9 


wile 


Next, 


= 36 3 1 
HiSpaea = [1 3]az “| 3|-d-2 24] 


Classification with Several Populations 


and 


Beate pack 1 -1 99 
1 Sydolea 1 = 35 [727 24\| | ae 
so 


z ose ioe os 
y(Xo) = In py + X{SpboteaX%o — 3 ¥iSpootea i 


-27 24 1/99 
= : ae + [= See 
ay ( 35 ) xo (2) x2 :(2) 


613 


Notice the linear form of (Xo) = constant + (constant) x9; + (constant) x92. Ina 


similar manner, 


1 | 36 3 1 
x87) = —_ = — (4 
KSoooied = [1 4] at 3 | 35 | 8 39] 
Si Nin, os 1 1 204 
XS rooted x2 > 35 [48 39} ] = 35, 
and 
e 48 39 1 (204 
d,(xo) = In (.25) + (#) Xou + (22) 20 2 (3) 
Finally, 
ee 1 | 36 3 1 
5S pooled = [0 -a4| 3 | > 35 [6 —18] 
a 1 0 36 
%5Spooied x3 = 351-6 =a8 [3| 7 35 
and 
- —6 -18 1 36 
d3(Xx9) = In (.50) + (=) Xo1 + e& ) xa = 2 (2) 
Substituting the numerical values x9, = —2 and x92 = —1 gives 


d(x) = —1.386 + (=) (-2) + (4) (-1) - = = 1.943 


7 
dy(xy) = 1.386 + (2) (-2) + (2) (-1) - ~ = —8.158 


d3(Xp) = -.693 + ()-a + (E)en = 2 = —.350 


Since d;(Xo) = —.350 is the largest discriminant score, we allocate xg to 73. Ml 


614 Chapter 11 Discrimination and Classification 


Example 11.11 (Classifying a potential business-school graduate student) The ad- 
mission officer of a business school has used an “index” of undergraduate 
grade point average (GPA) and graduate management aptitude test (GMAT) 
scores to help decide which applicants should be admitted to the school’s gradu- 
ate programs, Figure 11.9 shows pairs of x, = GPA, x. = GMAT values for 
groups of recent applicants who have been categorized as a: admit; 772: do not 
admit; and 73: borderline.’° The data pictured are listed in Table 11.6. (See 
Exercise 11.29.) These data yield (see the SAS statistical software output in 
Panel 11.1) 


n, = 31 nz = 28 nz = 26 


_ _[ 3.40 z=| 248 __f 299 
™ | 561.23 2 | 447.07 *3* | 446.23 


>of 287 5 =| 0361 -2.0188 
*™ | 488.45 pooled ~ | _2 0188 3655.9011 


GMAT 
720 
A 
A 
AA 
A 
630 A 
A A 
A A A 
AAAA A A 
540 B- BBB Cc A 
B AAAA A A 
B B B Cc Cc A 
BB cance A 
B B Cc Cc CAA 
450 BB c cc ce A 
B Cc ec A 
BB B B BB CC CC 
B B B c c 
aa BB c 
B A: Admit (7,) 
B B : Donotadmit (1) 
Cc C : Borderline (1,) 
270 
ee ey eer eee eee eS Se | —_—l___» GPA 
2.10 2.40 2.70 3.00 3.30 3.60 3.90 


Figure 11.9 Scatter plot of (x; = GPA, x. = GMAT) for applicants to a graduate 
school of business who have been classified as admit, do not admit, or borderline. 


In this case, the populations are artificial in the sense that they have been created by the 
admissions officer. On the other hand, experience has shown that applicants with high GPA and high 
GMAT scores generally do well in a graduate program; those with low readings on these variables 
generally experience difficulty. 


Classification with Several Populations 615 


Suppose a new applicant has an undergraduate GPA of x, = 3.21 anda GMAT 
score of x2 = 497. Let us classify this applicant using the rule in (11-54) with equal 
prior probabilities. 

With xg = [3.21, 497], the sample squared distances are 


Di (xo) = (xo — ¥1)'Spootea (Xo — ¥1) 


28.6096 .0158 | | 3.21 — 3.40 
= (3.21 - 3. — 561: 
taal yn ee 23] 0158 | ie Eel 


= 2.58 
D3(X) = (xo ~ X2)’Spdoted (Xo ~ ¥2) = 17.10 
D3(xo) = (xo — X3)'Spootea (Xo — ¥3) = 2.47 


Since the distance from xg = [3.21, 497] to the group mean X; is smallest, we assign 
this applicant to73, borderline. = 


The linear discriminant scores (11-49) can be compared, two at a time. Using 
these quantities, we see that the condition that d,(x) is the largest linear discrimi- 
nant score among d)(x),d2(x),...,d,(x) is equivalent to 


0 = d(x) — d(x) 


ee _ 
= (My — oi)/ZIx — 3 (Ma ~ Mi)’ ‘(ma + wi) + in) 


t 


for alii = 1, 2,..., ¢. 


PANEL I 1.1 SAS ANALYSIS FOR ADMISSION DATA USING PROC DISCRIM. 


title ‘Discriminant Analysis’; 

data gpa; 

infile ‘T11-6.dat’; 

input gpa gmat admit $; 

proc discrim data = gpa 

method = normal pool = yes manova wcov pcov fisterr crosstisterr; 
priors ‘admit’ = .3333 ‘notadmit' = .3333 ‘border’ = .3333; 

class admit; var gpa gmat; 


PROGRAM COMMANDS 


DISCRIMINANT ANALYSIS 


85 Observations 84 DF Total OUTPUT 
2 Variables 82 DF Within Classes 
3 Classes 2 DF Between Classes 


Class Level information 


Weight Proportion 
31.0000 0.364706 
26.0000 0.305882 


28.0000 0.329412 


(continues on next page) 


616 Chapter 11 Discrimination and Classification 


PANEL 11.1 (continued) 


DISCRIMINANT ANALYSIS WITHIN-CLASS COVARIANCE MATRICES 


ADMIT = admit DF = 30 
Variable GPA GMAT 
GPA 0.043558 0.058097 
GMAT 0.058097 4618.247312 


ADMIT = border DF =25 
Variable GPA GMAT 


ADMIT = notadmit DF = 27 


Variable GPA GMAT 
GPA 0.033649 -1.4192037 
GMAT ~1.192037 3891.253968 


Variable GPA GMAT 


GPA 0.036068 . 2.018759 
GMAT —2.018759- 3655.901121 


Multivariate Statistics and F Approximations 
$=2 M=-0.5 N = 39.5 
Statistic Value F Num DF Den DF Pr>F 
Wilks’ Lambda 0.12637661 73.4257 4 162 0.0001 
Pillai’s Trace 1.00963002 41.7973 4 164 0.0001 
Hotelling-Lawley Trace 5.83665601 116.7331 4 160 0.0001 
Roy’s Greatest Root 5.64604452 231.4878 2 82 0.0001 
NOTE: F Statistic for Roy’s Greatest Root is an upper bound. 
NOTE: F Statistic for Wilks’ Lambda is exact. 


DISCRIMINANT ANALYSIS LINEAR DISCRIMINANT FUNCTION 
Constant = ~.5X;COV'X, + InPRIOR, — Coefficient Vector = COV" X 


ADMIT 
admit border notadmit 
CONSTANT -241.47030 —-178.41437 -134,99753 
106.24991 92.66953 78.08637 
0.21218 0.17323 0.16541 


Generalized Squared Distance Function: 
DP(X) = (x — X%)' cov" (x — Xj) 

Posterior Probability of Membership in each ADMIT: 
Pr(jlxX) = exp(—.5D7(X))/SUM exp(—.5D¥(X)) 


PANEL 11.1 


From 


Classification with Several Populations 


(continued) 
Posterior Probability of Membership in ADMIT: 
Obs From Classified 
ADMIT into ADMIT admit border notadmit 
2 admit border * 0.1202 0.8778 0.0020 
3 admit border * 0.3654 0.6342 0.0004 
24 admit border * 0.4766 0.5234 0.0000 
31 admit border ¥ 0.2964 0.7032 0.0004 
58 notadmit border * 0.0001 0.7550 0.2450 
59 notadmit border * 0.0001 0.8673 0.1326 
66 border admit * 0.5336 0.4664 0.0000 
*Misclassified observation 
Classification Summary for Calibration Data: WORK.GPA 
Cross validation Summary using Linear Discriminant Function 
Generalized Squared Distance Function: 
DF(X) = (X ~ Kony)’ COVEY (XK ~ Xcoy) 
Posterior Probability of Membership in each ADMIT: 
Pr( IX) = exp(—.5D}(X))/SUM exp(—.5D#(X)) 
Number of Observations and Percent Classified into ADMIT: 
ADMIT 


admit 


OC ne: 
83.87 16.13 0.00 100.00 
border [7] [*] 26 
3.85 92.31 3.85 100.00 
notadmit [0] 26 28 
0.00 7.14 92.86 100.00 
27 31 27 85 
31.76 36.47 31.76 100.00 

0.3333 0.3333 0.3333 

Error Count Estimates for ADMIT: 

admit border notadmit Total 
0.1613 0.0769 0.0714 0.1032 


0.3333 0.3333 0.3333 


617 


Adding —In ( p;/p;) = \n(p;/p,) to both sides of the preceding inequality gives 


the alternative form of the classification rule that minimizes the total probability of 
misclassification. Thus, we 


Allocate x to 7, if 
? — 1 t az. Pi 
(Me ~ Bi)’ E OK — 5 (mae — Mi) E (Me + Bi) = in( 24) (11-55) 


for alli = 1,2,...,g. 


618 Chapter 11 Discrimination and Classification 


Now, denote the left-hand side of (11-55) by d,x). Then the conditions jn 
(11-55) define classification regions R,, R2,...,R,, which are separated by (hyper) 
planes. This follows because d; (x) is a linear combination of the components of y. 
For example, when g = 3, the classification region R, consists of all x satisfying 


redid) = In( 2) for = 2,3 
Pi 


That is, R; consists of those x for which 


Bois, oul : 
dyo(x) = (#y ~ #2)" x — PAG — Bo)’ Sy + wo) = In (2) 


and, simultaneously, 


dy3(x) = (My ~ H3)/Z x — 5 (ms — p3)'Z(wy + 3) = In (22) 
Assuming that 11, #2, and 423 do not lie along a straight line, the equations d,2(x) = 
In(p2/p,) and d(x) = In(p3/p;) define two intersecting hyperplanes that delin- 
eate Rj, in the p-dimensional variable space. The term In(p)/p;) places the plane 
closer to pr, than yz, if p is greater than p;. The regions R,, R2, and R; are shown in 
Figure 11.10 for the case of two variables. The picture is the same for more variables 
if we graph the plane that contains the three mean vectors. 

The sample version of the alternative form in (11-55) is obtained by substituting 
X, for #4, and inserting the pooled sample covariance matrix Spooteq for Z. When 


y (n; ~ 1) = p,so that Srdoled exists, this sample analog becomes 
i=l 


Figure 1.10 The classification 
regions R,, R2, and R; for the 
linear minimum TPM rule 

(Py = 5, Po = 4, Bs = 4) 


Classification with Several Populations 619 
Allocate x to a, if 
4 = = \'e-l Is. eres | ss es 
dy {x) = (Xx — Xi) SpooteaX — 5 — X;)'Spootea (Xe + X;) 


>In (2) foralli # k (11-56) 


Pk 


Given the fixed training set values x; and S,ootea, d(x) is a linear function of 
the components of x. Therefore, the classification regions defined by (11-56)—or, 
equivalently, by (11-52)—are also bounded by hyperplanes, as in Figure 11.10. 

As with the sample linear discriminant rule of (11-52), if the prior probabilities 
are difficult to assess, they are frequently all taken to be equal. In this case, 
In(p;/P~) = 0 for all pairs. 

Because they employ estimates of population parameters, the sample classifi- 
cation rules (11-48) and (11-52) may no longer be optimal. Their performance, 
however, can be evaluated using Lachenbruch’s holdout procedure. If nj’ is the 
number of misclassified holdout observations in the ith group, i = 1,2,...,g, then 
an estimate of the expected actual error rate, E(AER), is provided by 


$ il 
% P iM 
E(AER) = =\—— (11-57) 


n; 
i=1 


Example 11.12 (Effective classification with fewer variables) In his pioneering work 
on discriminant functions, Fisher [9] presented an analysis of data collected by 
Anderson [1] on three species of iris flowers. (See Table 11.5, Exercise 11.27.) 

Let the classes be defined as 


11: Tris setosa; 2: 1ris versicolor; 3: Iris virginica 
The following four variables were measured from 50 plants of each species. 
X = sepallength, X2 = sepal width 
X3 = petallength, X, = petal width 


Using all the data in Table 11.5, a linear discriminant analysis produced the confusion 
matrix 


Predicted membership 
Percent 
a. Setosa 712: Versicolor 73: Virginica correct 
71: Setosa 
Actual . 
membership 2" Versicolor 


113: Virginica 


620 Chapter 11 Discrimination and Classification 


The elements in this matrix were generated using the holdout procedure, som 
(see 11-57) 


a 


a 3 
E(AER) = 150 7 02 
The error rate, 2%, is low. 

Often, it is possible to achieve effective classification with fewer variables It} is- 
good practice to try all the variabjes one at a time, two at a time, three at a time, and: 
so forth, to see how well they classify compared to the discriminant function, which: 
uses all the variables. og 

If we adopt the holdout estimate of the expected AER as our criterion, we find 
for the data on irises: Ae 


Single variable Misclassification rate 


xX 253 
XxX, 480 
X3 053 
X4 040 


Pairs of variables Misclassification rate 


X,, X2 
X, X43 
4s 
X2, X3 
Xz, X4 
X;,X4 


We see that the single variable ¥, = peta) width does a very good job of distin- 
guishing the three species of iris. Moreover, very little is gained by including more 
variables. Box plots of X, = petal width are shown in Figure 11.11 for the three 
species of iris. It is clear from the figure that petal width separates the three groups 
quite well, with, for example, the petal widths for Iris setosa much smaller than the 
petal widths for Jris virginica. 

Darroch and Mosimann [6] have suggested that these species of iris may be dis- 
criminated on the basis of “shape” or scale-free information alone. Let ¥ = X\/X2 
be the sepal shape and ¥, = X3/X;, be the petal shape. The use of the variables % 
and Y, for discrimination is explored in Exercise 11.28. 

The selection of appropriate variables to use in a discriminant analysis is often 
difficult, A summary such as the one in this example allows the investigator to make 
reasonable and simpie choices based on the ultimate criteria of how well the proce- 
dure classifies its target objects. = 


Our discussion has tended to emphasize the linear discriminant rule of (11-52)- 
or (11-56), and many commercial computer programs are based upon it. Althoug! 
the linear discriminant rule has a simple structure, you must remember that 1 
was derived under the rather strong assumptions of multivariate normality and: 
equal covariances. Before implementing a Jinear classification rule, these tentative, 


Fisher's Method for Discriminating among Several Populations 621 


25 4 

2.0 7 
g 1s 4 
: 
£ 10- 

05 74 z 

0.0 

See ee eee a 
ny ue ua) 
Group 
Figure 11.11 Box plots of petal width for the three species of iris. 


assumptions should be checked in the order multivariate normality and then equal- 
ity of covariances. If one or both of these assumptions is violated, improved classifi- 
cation may be possible if the data are first suitably transformed. 

The quadratic rules are an alternative to classification with linear discriminant 
functions. They are appropriate if normality appears to hold, but the assumption of 
equal covariance matrices is seriously violated. However, the assumption of normal- 
ity seems to be more critical for quadratic rules than linear rules. If doubt exists as to 
the appropriateness of a linear or quadratic rule, both rules can be constructed and 
their error rates examined using Lachenbruch’s holdout procedure. 


11.6 Fisher's Method for Discriminating 
among Several Populations 


Fisher also proposed an extension of his discriminant method, discussed in 
Section 11.3, to several populations. The motivation behind the Fisher discriminant 
analysis is the need to obtain a reasonable representation of the populations that in- 
volves only a few linear combinations of the observations, such as ax, ax, and a3x. 
His approach has several advantages when one is interested in separating several 
populations for (1) visual inspection or (2) graphical descriptive purposes. It allows 
for the following: 


1. Convenient representations of the g populations that reduce the dimension 
from a very large number of characteristics to a relatively few linear combina- 
tions. Of course, some information—needed for optimal classification—may be 
lost, unless the population means lie completely in the lower dimensional space 
selected. 


622 Chapter 11 Discrimmation and Classification 


2. Plotting of the means of the first two or three linear combinations (discriminants), 
This helps display the relationships and possible groupings of the populations, 
3. Scatter plots of the sample values of the first two discriminants, which can ind 


cate outliers or other abnormalities in the data. we 


The primary purpose of Fisher’s discriminant analysis is to separate populations, It: 


can, however, also be used to classify, and we shall indicate this use. It is not neces. 
sary to assume that the g populations are multivariate normal. However, we d 
assume that the p < p population covariance matrices are equal and of full rank. 
That is, ¥, = 2, =---= 2X, = X. 

Let js denote the mean vector of the combined populations and B, the betweer 
groups sums of cross products, so that : 


B, = >» (Mi — BH) (Mi ~ fe)’ where ft = 7 p> Hy (11-58) ’ 
We consider the linear combination 
Y =a'X 
which has expected value 
E(Y¥) = a'E(Xi7;) = a'p; for population 7; 
and variance 
Var(Y) = a’ Cov(X)a = a'Za_ _ forall populations 3 


Consequently, the expected value yy = a’; changes as the population from which 
X is selected changes, We first define the overall mean 


ee 1 (33 ) 
rE & j= & 2 


& i= 


i} 
M 
= 
ne 
i} 
| 
iM 
a 
F 
i! 
i) 


and form the ratio 


sum of squared distances from oe 42 jen) weeend 
( cereinnone to overall mean of Y 2 (Hiy ~ By) > (a’u; — a’) 


(variance of Y’) oy a'da 
g 
“(> (4; - HB) (Hi - j')a 
7 a’Za 


or 


g 
D (uy - yyy : 
er ae ae 
Y 


"If not, we let P = [e1,.-.,€,] be the eigenvectors of ¥ corresponding to nonzero eigenvall 
[Ay,..., Ag]. Then we replace X by P’X which has a full rank covariance matrix P’EP. 


Fisher’s Method for Discriminating among Several Populations 623 


The ratio in (11-59) measures the variability between the groups of Y-values relative 
to the common variability within groups. We can then select a to maximize this ratio. 

Ordinarily, 2 and the yz; are unavailable, but we have a training set consisting of 
correctly classified observations. Suppose the training set consists of a random sam- 
ple of size n; from population 7;, i = 1,2,..., g. Denote the n; x p data set, from 
population 7;, by X; and its jth row by x! ; After first constructing the sample mean 
vectors 


and the covariance matrices S;, i = 1,2,...,g, we define the “overall average” 
vector 


which is the p < 1 vector average of the individual sample averages. 
Next, analagous to B, we define the sample between groups matrix B. Let 


B= > (x; — ¥)(%, — x)! (11-60) 
i=] 


Also, an estimate of % is based on the sample within groups matrix 


n, 


W = Sn, =i o> > (xij — %) xij — %' (11-61) 


i=1 j=l 


Consequently, W/(n, + nz +--+ + ny — 8) = Spootea is the estimate of %. Be- 
fore presenting the sample discriminants, we note that W is the constant 
(my) + my ++++ + ng — g) times Spooea, So the same 4 that maximizes 
4'Ba/a’S,oolea also maximizes 4'Ba/a’ Wa. Moreover, we can present the optimiz- 
ing a in the more customary form as eigenvectors é; of W~'B, because if 
W'Bé = A6 then Soh jeg Be = A(ny + ny + +++ + ng — ge. 


Fisher’s Sample Linear Discriminants 


Let Aj, A),...,4, > 0 denote the s < min(g — 1, p) nonzero eigenvalues of 
W'B and @,...,@, be the corresponding eigenvectors (scaled so that 
'Spooted€ = 1). Then the vector of coefficients a that maximizes the ratio 


aA . f 
pwi 7 TER (11-62) 
|S D (xi; — ¥) (xy - x] 


i=l j=l 


is given by a, = é). The linear combination 4jx is, called the sample first dis- 
criminant.The choice 4) = é) produces the sample second discriminant, &5x, and 
continuing, we obtain ajx = é;x, the sample kth discriminant, k < s. 


624 Chapter 11 Discrimination and Classification 


Exercise 11.2] outlines the derivation of the Fisher discriminants. The discriminants 
will not have zero covariance for each random sample X;. Rather, the condition 


1 ifi=ke=es 


0 otherwise (11-63) 


4/8 ,ooled ax val { 
will be satisfied. The use of Spoolea is appropriate because we tentatively assumed 
that the g population covariance matrices were equal. 


Example 11.13 (Calculating Fisher's sample discriminants for three populations) 
Consider the observations on p = 2 variables from g = 3 populations given in 
Example 11.10. Assuming that the populations have a common covariance matrix 
&, let us obtain the Fisher discriminants. The data are ; 


7, (nm; = 3) 2 (nz = 3) 773 (n3 = 3) 
-2 5 0 6 1 ~2 

Xi=| 03]; XK,=(2 4); Xs=| 0 0 
-1 1 1 2 -1 -4 


In Example 11.10, we found that 


SO, 


I 
i 


3 ' 2 1 
5 B= > (% — *)(®% — ¥) |? a 


0 
3 
3 
3 
W = DD (i, — Xi) ij — €)' = (1 + m2 + 3 — 3) Spocted 


1 | 24 2 3571 4667 
as 7 lApa{: 
kid val 2 | nee re al 
To solve for the s < min(g — 1, p) = min(2,2) = 2 nonzero eigenvalues of w'B, 


we must solve 


3571 — A 4667 || ih 


See gaan 
eB if 0714 9000 - A 
or 
(.3571 ~ A) (.9000 — A) ~ (4667) (.0714) = A? — 1.2571A + .2881 = 0 


Using the quadratic formula, we find that A, = .9556 and Az = .3015. The nor- 
malized eigenvectors a) and 4, are obtained by solving 


© (W'B-ADa=0 i=1,2 


Fisher’s Method for Discriminating among Several Populations 625 


and scaling the results such that 4/Spooiea4; = 1. For example, the solution of 


~.. _ | 3571-9556 4667 A 0 
w'B-A = egal 
( Dm i 0714 ~—-9000 — | kal H 


is, after the normalization 41S,ooea4, = 1, 


aj = [.386 .495] 
Similarly, 

a = [.938 -—.112] 
The two discriminants are 


J, = alx = [386 .495] a = 386x, + 495x, 
2 


Jy = ahx = [.938 -.112] ee = 938x, — .112x, ma 
2 
Example 11.14 (Fisher’s discriminants for the crude-oil data) Gerrild and Lantz [13] 
collected crude-oil samples from sandstone in the Elk Hills, California, petroleum 
reserve. These crude oils can be assigned to one of the three stratigraphic units 
(populations) 
a1: Wilhelm sandstone 
7: Sub-Mulinia sandstone 
773: Upper sandstone 
on the basis of their chemistry. For illustrative purposes, we consider only the five 
variables: 
X, = vanadium (in percent ash) 


X2 = Viron (in percent ash) 


X3 = V beryllium (in percent ash) 
X4 = 1/{saturated hydrocarbons (in percent area) | 
Xs, = aromatic hydrocarbons (in percent area) 


The first three variables are trace elements, and the Jast two are determined from 
a segment of the curve produced by a gas chromatograph chemical analysis. Table 
11.7 (see Exercise 11.30) gives the values of the five original variables (vanadium, 
iron, beryllium, saturated hydrocarbons, and aromatic hydrocarbons) for 56 cases 
whose population assignment was certain. 

A computer calculation yields the summary statistics 


3.229 4.445 7.226 6.180 
6.587 5.667 4.634 5.081 
x, = 303 |, x, =| 344], =*,=|] 598], x=| 511 
150 157 223 201 


626 Chapter 11 Discrimination and Classification 


and 
(ny + mz + nz — 3)Spootea = (38 + 11 + 7 — 3) Shootea 
187.575 
1.957 41.789 
=W=| —4.031 2.128 3.580 


1.092 ~143 -.284 077 
79.672 -—28243 2559 -.996 338.023 
There are at most s = min(g ~ 1, p) = min(2,5) = 2 positive eigenvalues of 
W'B, and they are 4.354 and .559. The centered Fisher linear discriminants are 
J, = .312(x, — 6.180) — .710(x2 — 5.081) + 2.764(x3 — 511) 
+ 11.809(x4 — .201) — .235(x5 — 6.434) 
Jo = .169(x, - 6.180) — .245(x2 — 5.081) — 2.046(x3 — .511) 
— 24.453(xq — .201) — .378(x5 — 6.434) 
The separation of the three group means is fully explained in the two- 
dimensional “discriminant space.” The group means and the scatter of the individual 
observations in the discriminant coordinate system are shown in Figure 11.12. The 
separation is quite good. nm 


Wilhelm 
Sub-Mulinia 
Upper 

Mean coordinates 


Figure [1.12 Crude-oil samples in discriminant space. 


Fisher’s Method for Discriminating among Several Populations 627 


Example 11.15 (Plotting sports data in two-dimensional discriminant space) Investi- 
gators interested in sports psychology administered the Minnesota Multiphasic 
Personality Inventory (MMPI) to 670 letter winners at the University of Wisconsin 
in Madison. The sports involved and the coefficients in the two discriminant 
functions are given in Table 11.3. 

A plot of the group means using the first two discriminant scores is shown in 
Figure 11.13. Here the separation on the basis of the MMPI scores is not good, 
although a test for the equality of means is significant at the 5% level. (This is due to 
the large sample sizes.) 

While the discriminant coefficients suggest that the first discriminant is most 
closely related to the L and Pa scales, and the second discriminant is most closely 
associated with the D and Pt scales, we will give the interpretation provided by the 


investigators. 
The first discriminant, which accounted for 34.4% of the common variance, was 
highly correlated with the Mf scale (r = ~.78). The second discriminant, which 


accounted for an additional 18.3% of the variance, was most highly related to scores 
on the Sc, F, and D scales (r’s = .66, .54, and .50, respectively). The investigators 
suggest that the first discriminant best represents an interest dimension; the second 
discriminant reflects psychological adjustment. 

Ideally, the standardized discriminant function coefficients should be examined 
to assess the importance of a variable in the presence of other variables. (See [29].) 
Correlation coefficients indicate only how each variable by itself distinguishes the 
groups, ignoring the contributions of the other variables. Unfortunately, in this case, 
the standardized discriminant coefficients were unavailable. 

In general, plots should also be made of other pairs of the first few discrimi- 
nants. In addition, scatter plots of the discriminant scores for pairs of discriminants 
can be made for each sport. Under the assumption of multivariate normality, the 


Table 11.3 


Second 
discriminant 


First 


Sport Sample size discriminant 


: QE .055 —.098 
Football 158 L —,194 046 
Basketball 42 F —.047 ~.099 
Baseball 79 ~.017 
Crew 61 
Fencing 50 
Golf 28 . 

Gymnastics 26 Pd 001 —.069 
Hockey 28 Mf —.074 ~.076 
Swimming 51 Pa 189 088 
Tennis 31 Pt 025 —,188 
Track 52 Sc —.046 088 


Wrestling 


Source: W. Morgan and R. W. Johnson. 


628 Chapter 11 Discrimination and Classification 


Second discriminant 


© Swimming 


@ Fencing 
© Wrestling 


@ Tennis 


First discriminant 


—- ; F = 2 


e 
Track 
e@ Gymnastics 

@Crew 4 


Basebail 


Golf © Basketball 


-6 
Figure 11.13 The discriminant means y’ = [¥, 2] for each sport. 


unit ellipse (circle) centered at the discriminant mean vector y should contain 
approximately a proportion 


PUY — wy)’ (¥ ~ my) 5 1) = PIS 1] = 39 


of the points when two discriminants are plotted. = 


Using Fisher’s Discriminants to Classify Objects 


Fisher’s discriminants were derived for the purpose of obtaining a {ow-dimensional 
representation of the data that separates the populations as much as possible. Al- 
though they were derived from considerations of separation, the discriminants also 
provide the basis for a classification rule. We first explain the connection in terms of 
the population discriminants a; X. 


Setting 
Y, = aX = kthdiscriminant, k = s (11-64) 
we conclude that 
¥, ' 
y, Hi, aM: 
Y=| . |hasmeanvectorpjyy=| 2 | =| : 
y hi, ae; 


s 


under population 7; and covariance matrix I, for all populations. (See Exercise 11.21.) 


Fisher’s Method for Discriminating among Several Populations 629 


Because the components of Y have unit variances and zero covariances, the 
appropriate measure of squared distance from Y = y to py is 


Ce Cee > (y - wiv)? 


A reasonable classification rule is one that assigns y to population 7, if the square 
of the distance from y to xx, y is smaller than the square of the distance from y to piy 
fori # k. 

If only r of the discriminants are used for allocation, the rule is 


Allocate x to 7, if 
> (7 > wey)? = > [aie - wed? 
j=l j=l 
< > [ai(x—;)) foralli#k (11-65) 
= 


Before relating this classification procedure to those of Section 11.5, we look 
more closely at the restriction on the number of discriminants. From Exercise 11.21, 


s = number of discriminants = number of nonzero eigenvalues of &'B, 
or of £178,271? 
Now, & 'B, is p X p,sos = p. Further, the g vectors 
i ~ By M2 — B,..-,o@g — Bh _ (11-66) 


satisfy (#41 — @) + (2 — M) +++: + (My — MB) = BH — Be = O.Thatis, the first 
difference 44; — ft can be written as a linear combination of the last g — 1 differ- 
ences. Linear combinations of the g vectors in (11-66) determine a hyperplane of di- 
mension gq = g — 1. Taking any vector e perpendicular to every x; ~ f&, and hence 
the hyperplane, gives 


& & 
Bye = > (mi ~ B)(mi — B)'e = > (wu; — @)0 =0 


so 
-1 a 
Xx B,e =0e 


There are p ~ q orthogonal eigenvectors corresponding to the zero eigenvalue. This 
implies that there are g or fewer nonzero eigenvalues. Since it is always true that 
q = g — 1, the number of nonzero eigenvalues s must satisfy s = min(p,g — 1). 

Thus, there is no loss of information for discrimination by plotting in two 
dimensions if the following conditions hold. 


Number of Number of Maximum number 
variables populations of discriminants 
Any p ga2 1 
Any p g=3 2 


p=2 Any g 2 


630 Chapter 11 Discrimination and Classification 


We now present an important relation between the classification rule (11-65) 
and the “normal! theory” discriminant scores [see (11-49)], 


d(x) = pid'x — Spi" pn; + Inp; 


or, equivalently, 
d(x) — 3x'Bx = 3 (x — p)'EUx~ gw) + Imp; 
obtained by adding the same constant —3x'a x to each d,(x). 


Result 11.6. Let y; = a;x, where aj = re; and e, is an eigenvector of 2B Le, 


Then 
D3 () ~ Miy,)? = 2 [aj (x — pi) = (x - w)'E "(x - g,) 
= —2d(x) + x’X x + 2Inp; 
TA, = ++ 2 Ay > OF Ag = + = Ay, ¥ (yj - Miy,)” is constant for all popu- 


j=stl 
s 
lations 7 = 1,2,...,g so only the first s discriminants y;, or > (yj ~ hy,” con- 
jel 


tribute to the classification. 


Also, if the prior probabilities are such that p, = p2 = ++: = Pg = 1/g, the rule 
(11-65) with r = s is equivalent to the population version of the minimum TPM 
rule (11-52). 


Proof. The squared distance (x — ge;)"Z71(x — p;) = (x — wi)'Z 7721 (x — p,) 

= (x — gj)’ 7?EE'Z"/(x — ys;), where E = [e),€2,-..,€,] is the orthogonal 

matrix whose columns are eigenvectors of &~/B,%"”. (See Exercise 11.21.) 
Since £7e; = a; or ai = ef, 


ai(x — Bi) 
E'=7(x ~ pj) = an(x a Hi) 
a(x — #;) 


and 


">T '>T- ’ 2 
(x= gE VERE V(x — pw) = S fay(x ~ wd] 
jal 
Next, each a; = £~"e,, j > s, isan (unscaled) eigenvector of 2~'B,, with eigen- 
value zero. As shown in the discussion following (11-66), a; is perpendicular to every 
#; — ft and hence to (px — ft) — (MH; — H) = wy — wi for i,k = 1,2,...,g. The 


Fisher’s Method for Discriminating among Several Populations 631 


condition 0 = aj(p, — pi) = HkY, — Biy; implies that y; — Bey, = Yj ~ Bir, 8° 


p 
> (yj - biy,) is constant for all i = 1,2,...,g. Therefore, only the first s dis- 


jest! 
criminants y; need to be used for classification. | 


We now state the classification rule based on the first r = s sample discriminants. 


Fisher’s Classification Procedure Based 
on Sample Discriminants 


Allocate x to 7; if 


r 


DY (5 — Fes)” = > [aj a(x - X,)[ s Dd (a; (x — &)] P foralli#k 


j=1 j=l i= 
(11-67) 
where a, is defined in (11-62), },; = ajx, andr = s. 
When the prior probabilities are such that py = pp =--: = py = 1/g andr = 8, 


rule (11-67) is equivalent to rule (11-52), which is based on the largest linear dis- 
criminant score. In addition, if r < s discriminants are used for classification, there 


P 
is a loss of squared distance, or score, of >) [aj(x ~ %;)]° for each population 7; 


j=rtl 
s 


where >) [ai(x ~ xP is the part useful for classification. 
j=rt) 


Example 11.16 (Classifying a new observation with Fisher’s discriminants) Let us 
use the Fisher discriminants 


yy = alx = .386x, + A95x, 
Ye = abx _ -938x, ae! 112x, 


from Example 11.13 to classify the new observation xp = [1 3] in accordance with 
(11-67). 
Inserting x9 = [X91, X92] = [1 3), we have 
3, = 386x0, + 495x092 = .386(1) + .495(3) = 1.87 
Fy = .938x9; — 112x902 = 938(1) — 112(3) = 


tt 


Moreover, }; = 4jX,, 80 that (see Example 11.13) 


x, = [386 495}| “4 ] = 110 


A -1 
V2 = 42x) = [.938 -11] ‘] = —1.27 


Vir 


2 Chapter 11 Discrimination and Classification 


Similarly, 


Finally, the smallest value of 
oe 2 2 2 
> (5 — vei) = ¥ [8 (x - %)] 
j=) jr 
for k = 1,2,3, must be identified. Using the preceding numbers gives 


2 
S (3; — jy)” = (1.87 - 1.10)? + (.60 + 1.27)? = 4.09 
jel : 


(5) — jai)’ = (1.87 — 2.37)? + (.60 - 49)? = .26 


a 
iM» 
i 


2 
SD (5) ~ Jai)” = (1.87 + 99)? + (60 — .22)? = 8.32 
j=l 


2 

Since the minimum of S (5; — Tei) occurs when k = 2, we allocate xg to 
: j=1 

population 772. The situation, in terms of the classifi0ers yj, is illustrated schematical- 


ly in Figure 11.14. | 


Smallest distance 
ye 3 
—_ @ ¥2 
@y3 
a | Ped og 
-1 1 2 3 
Figure 11.14 
-~1 The points y’ = [1.2], = 
me ¥i = [Yar Fiz], ¥2 = [a1 Yeah 
and ¥3 = [¥31, 32] in the 


classification plane. 


Fisher’s Method for Discriminating among Several Populations 633 


Comment. When two linear discriminant functions are used for classification, 
observations are assigned to populations based on Euclidean distances in the two- 
dimensional discriminant space. 

Up to this point, we have not shown why the first few discriminants are more 
important than the last few. Their relative importance becomes apparent from their 
contribution to a numerical measure of spread of the populations. Consider the sep- 
aratory measure 


Ag = DS (mi — #)'X"(Hi — B) (11-68) 


where 


and (ya; — #)'S"(p; — fe) is the squared statistical distance from the ith 
population mean ys; to the centroid ft. It can be shown (see Exercise 11.22) that 
AR = Ay t+ AQ tert A, where the A, = A, = --: = A, are the nonzero eigenvalues 


of 5B (or &/?BE-Y?) and A,44,-.- , Ap are the zero eigenvalues. 


The separation given by A% can be reproduced in terms of discriminant means. 
The first discriminant, ¥, = e{%7'/?X has means jy, = ej ¥’/?q4; and the squared 


8 
distance >) (Mir, — fy, of the y,;y,’s from the central value fy, = ef eit is A;. 


= 
(See Exercise 11.22.) Since A% can also be written as 


AS =Aytagt- +A, 


it 


s (Miy — By)’ (Miy - By) 


i=] 

g - 2, - 2 é — \2 
= 2 (wiv, — By,) + >) (my, — By, +--+ 3 (My, - Hy,) 

= esl iA 


it follows that the first discriminant makes the largest single contribution, A;, to the 
separative measure A%. In general, the rth discriminant, Y, = e/~/*X, contributes 
A, to AZ. If the next s — r eigenvalues (recall that Aj4; = Asan = 0° = Ap = 0) are 
such that A,44 + A,;42 +--+: + A, is small compared to A, + Az +--+ + A,, then the 
last discriminants Y,,;, Y-+2,..., ¥, can be neglected without appreciably decreasing 
the amount of separation}? 

Not much is known about the efficacy of the allocation rule (11-67). Some insight 
is provided by computer-generated sampling experiments, and Lachenbruch [23] 
summarizes its performance in particular cases. The development of the population re- 
sult in (11-65) required a common covariance matrix &. If this is essentially true and 
the samples are reasonably large, rule (11-67) should perform fairly well. In any event, 
its performance can be checked by computing estimated error rates. Specifically, 
Lachenbruch’s estimate of the expected actual error rate given by (11-57) should be 
calculated. 


See [18] for further optimal dimension-reducing properties. 


634 Chapter 11 Discrimination and Classification 


11.7 Logistic Regression and Classification 


Introduction 


The classification functions already discussed are based on quantitative variables ~ 
Here we discuss an approach to classification where some or all of the variables are 
qualitative. This approach is called logistic regression. In its simplest setting, the 
response Variable Y is restricted to two values. For example, Y may be recorded as _ 
“male” or “female” or “employed” and “not employed.” 

Even though the response may be a two outcome qualitative variable, we can . 
always code the two cases as 0 and 1. For instance, we can take male = 0 and 
female = 1. Then the probability p of 1 is a parameter of interest. It represents the:- 
proportion in the population who are coded 1. The mean of the distribution of 0's 
and 1’s is also p since 


mean =0X (l1-p)+1Xp=p 
The proportion of 0’s is 1 — p which is sometimes denoted as q. 
The variance of the distribution is 
variance = 0? X (1 — p) + 2 X p- p= pil - p) 


It is clear the variance is not constant. For p = .5, it equals .5 X .5 = .25 while for 
= .8,it is .8 X .2 = .16. The variance approaches 0 as p approaches either 0 or 1. 
Let the response Y be either 0 or 1. If we were tomodel the probability of 1 with 

a single predictor linear model, we would write 


p= E(Y|z) = Bot Biz 
and then add an error term e. But there are serious drawbacks to this model. 


e The predicted values of the response Y could become greater than 1 or less than 
0 because the linear expression for its expected value is unbounded. 


© One of the assumptions of a regression analysis is that the variance of Y is con- 
stant across all values of the predictor variable Z. We have shown this is not the 
case. Of course, weighted least squares might improve the situation. 


We need another approach to introduce predictor variables or covariates Z into 
the model (see [26]). Throughout, if the covariates are not fixed by the investigator, 
the approach is to make the models for p(z) conditional on the observed values 
of the covariates Z = z. 


The Logit Model 


Instead of modeling the probability p directly with a linear model, we first consider 
the odds ratio 


odds = ae 


which is the ratio of the probability of 1 to the probability of 0. Note, unlike proba- : 
bility, the odds ratio can be greater than 1. If a proportion .8 of persons will get, 


Logistic Regression and Classification 635 


Figure 11.15 Natural log of 
odds ratio. 


through customs without their luggage being checked, then p = .8 but the odds of 
not getting checked is .8/.2 = 4 or 4 to 1 of not being checked. There is a lack of 
symmetry here since the odds of being checked are .2/.8 = 1/4. Taking the natural 
logarithms, we find that In(4) = 1.386 and In(1/4) = —1.386 are exact opposites. 

Consider the natural log function of the odds ratio that is displayed in 
Figure 11.15. When the odds x are 1,so outcomes 0 and 1 are equally likely, the nat- 
ural log of x is zero. When the odds x are greater than one, the natural log increases 
slowly as x increases. However, when the odds x are less than one, the natural log de- 
creases rapidly as x decreases toward zero. 

In logistic regression for a binary variable, we model the natural log of the odds 
ratio, which is called /ogit(p). Thus 


logit(p) = In(odds) = nf ; P 5) (11-69) 


The logit is a function of the probability p. In the simplest model, we assume that the 
logit graphs as a straight line in the predictor variable Z so 


logit(p) = In(odds) = n( = Bo + Biz (11-70) 


1~p 
In other words, the log odds are linear in the predictor variable. 

Because it is easier for most people to think in terms of probabilities, we can 
convert from the logit or log odds to the probability p. By first exponentiating 


in; 2) = Bot Biz 


P(z) 
BT an) 


we obtain 


= exp(Bo + 812) 


636 Chapter 11 Discrimination and Classification 


1.0 
0.95 
0.8 


0.6 


0.0 0.5 1.0 LS 2.0 Figure 11.16 Logistic function 
z with By = —1 and B, = 2. 


where exp = e = 2.718 is the base of the natural logarithm. Next solving for 6(z), 
we obtain 


exp(Bo + Biz) 

P(2) = 75 exp(fo # Ba) (11-71) 
which describes a logistic curve. The relation between p and the predictor z is not lin- 
ear but has an S-shaped graph as illustrated in Figure 11.16 for the case By = —1 and 
8B, = 2.The value of Bo gives the value exp(fo)/(1 + exp(Bg)) for p when z = 0. 

The parameter @; in the logistic curve determines how quickly p changes with z 
but its interpretation is not as simple as in ordinary linear regression because the re- 
lation is not linear, either in z or 8;. However, we can exploit the linear relation for 
log odds. 

To summarize, the logistic curve can be written as 


exp(Bo + B1z) 1 


P(z) = 1 + exp(Bo + Biz) Or pla)= 1 + exp(—Bo ~ 212) 


Logistic Regression Analysis 


Consider the model with several predictor variables. Let (zj1, z;2,.--» Zr) be the val- 
ues of the r predictors for the j-th observation. It is customary, as in normal theory 
linear regression, to set the first entry equal to 1 andz; = [1, zj,Zp,-++> Zjr]'- Con- 
ditional on these values, we assume that the observation ¥; is Bernoulli with success 
probability p(zj), depending on the values of the covariates. Then 


P(Y; = yj) = P(z)(1 — plz) ~— for y, = 0,1 
sO 


E(¥;) = p(zj) and Var(¥;) = p(z;)(1 - p(z;)) 


Logistic Regression and Classification 637 


It is not the mean that follows a linear model but the natural log of the odds ratio. In 
particular, we assume the model 


In (22,) — Bo fe Biz 5 a aa BZ, = pz; (11-72) 


where B = [80,A1,..-,8B,]'- 


Maximum Likelihood Estimation. Estimates of the 8’s can be obtained by the 
method of maximum likelihood. The likelihood L is given by the joint probability 
distribution evaluated at the observed counts y,. Hence 


L(bo; b,... »b,) = TLp(1 ~~ p(z;))'-» 
j= 


TI. erkbot bent ve AD pZyp) 

= dia! 

’ I’. 1 + ebot bizat...+ bir) (11-73) 
j= 


The values of the parameters that maximize the likelihood cannot be expressed 
in a nice closed form solution as in the normal theory linear models case. Instead 
they must be determined numerically by starting with an initial guess and iterating 
to the maximum of the likelihood function. Technically, this procedure is called an 
iteratively re-weighted least squares method (see [26]). 

We denote the numerically obtained values of the maximum likelihood esti- 
mates by the vector B. 


Confidence Intervals for Parameters. When the sample size is large, B is approxi- 
mately normal with mean £, the prevailing values of the parameters and approxi- 
mate covariance matrix 


Cov( B) ad | Saya = peda | (11-74) 
j= 


The square roots of the diagonal elements of this matrix are the large sample esti- 
mated standard deviations or standard errors (SE) of the estimators Bo, B),-.-, Br 
respectively. The large sample 95% confidence interval for 6, is 


B, + 196SE(B,) k=0,1,...,7 (11-75) 


The confidence intervals can be used to judge the significance of the individual 
terms in the model for the logit. Large sample confidence intervals for the logit and 
for the population proportion p(z,) can be constructed as well. See [17] for details. 


Likelihood Ratio Tests. For the model with r predictor variables plus the constant, 
we denote the maximized likelihood by 


Lnax = L(Bo, Ai, Sack Br) 


638 Chapter 11 Discrimination and Classification 


If the null hypothesis is Hp: 8, = 0, numerical calculations again give the maximum 
likelihood estimate of the reduced model and, in turn, the maximized value of the 
likelihood 


Lomax. Reduced ~ L(Bo, Bi, vere Br-1, Beats sy B,) 


When doing logistic regression, it is common to test Hp using minus twice the log- 
likelihood ratio 


L ‘ 
—2In [es (11-76) 
e Lmax 


which, in this context, is called the deviance. It is approximately distributed as chi- 
square with 1 degree of freedom when the reduced model has one fewer predictor 
variables. Ho is rejected for a large value of the deviance. 

An alternative test for the significance of an individual] term in the model for the 
logit is due to Wald (see [17]). The Wald test of Hp: 8, = 0 uses the test statistic 
Z= BUSE ( B,) or its chi-square version Z? with 1 degree of freedom. The likeli- 
hood ratio test is preferable to the Wald test as the level of this test is typically clos- 
er to the nominal a. 

Generally, if the null hypothesis specifies a subset of, say, m parameters are si- 
multaneously 0, the deviance is constructed for the implied reduced model and re- 
ferred to a chi-squared distribution with m degrees of freedom. 

When working with individual binary observations Y,, the residuals 


¥; — P(2)) 


V p(z;)(1 — P(z;) 


each can assume only two possible values and are not particularly useful. It is better 
if they can be grouped into reasonable sets and a total residual calculated for each 
set. If there are, say, t residuals in each group, sum these residuals and then divide by 
t to help keep the variances compatible. 
We give additional details on logistic regression and model checking following 
and application to classification. 


Classification 


Let the response variable Y be 1 if the observational unit belongs to population 1 
and 0 if it belongs to population 2. (The choice of 1 and 0 for response outcomes is 
arbitrary but convenient. In Example 11.17, we use 1 and 2 as outcomes.) Once a 
logistic regression function has been established, and using training sets for each of 
the two populations, we can proceed to classify. Priors and costs are difficult to 
incorporate into the analysis, so the classification rule becomes 


Assign z to population 1 if the estimated odds ratio is greater 
than 1 or 


P(2) 


a Bo + Bit +--+ Bz) > 
1 — plz) exp(Bo + Biz, B,Z,) > 1 


Logistic Regression and Classification 639 


Equivalently, we have the simple linear discriminant rule 


Assign z to population 1 if the linear discriminant is greater 
than 0 or 
P(z) 


ae ee ee 11-77 
Ing =i B(z) Bo Biz + + B,2, 0 ( ) 


Example 11.17 (Logistic regression with the salmon data) We introduced the salmon 
data in Example 11.8 (see Table 11.2). In Example 11.8, we ignored the gender of the 
salmon when considering the problem of classifying salmon as Alaskan or Canadian 
based on growth ring measurements. Perhaps better classification is possible if gen- 
der is included in the analysis. Panel 11.2 contains the SAS output from a logistic re- 
gression analysis of the salmon data. Here the response Y is 1 if Alaskan salmon and 
2 if Canadian salmon. The predictor variables (covariates) are gender (1 if female, 2 if 
male), freshwater growth and marine growth. From the SAS output under Testing 
the Global Null Hypothesis, the likelihood ratio test result (see 11~76) with the re- 
duced model containing only a fp term) is significant at the < .0001 level. At least 
one covariate is required in the linear model for the logit. Examining the significance 
of individual terms under the heading Analysis of Maximum Likelihood Estimates, 
we See that the Wald test suggests gender is not significant (p-value = .7356). On the 
other hand, freshwater growth and marine are significant covariates. Gender can be 
dropped from the model. It is not a useful variable for classification. The logistic re- 
gression model can be re-estimated without gender and the resulting function used 
to classify the salmon as Alaskan or Canadian using rule (11-77). 

Turning to the classification problem, but retaining gender, we assign salmon j 
to population 1, Alaskan, if the linear classifier 


B'z = 3.5054 + .2816 gender + .1264 freshwater + 0486 marine = 0 


The observations that are misclassified are 


Row Pop Gender Freshwater Marine Linear Classifier 


2 1 1 131 355 3.093 
12 1 2 123 372 1.537 
13 1 1 123 372 1.255 
30 1 2 118 381 0.467 
51 2 1 129 420 —0.319 
68 2 2 136 438 —0.028 
71 2 2 90 385 —3.266 


From these misclassifications, the confusion matrix is 


Predicted membership 


7: Alaskan a: Canadian 


miAlekon [46 [4 
Actual : 
m;: Canadian 


640 Chapter 11 Discrimination and Classification 


and the apparent error rate, expressed as a percentage is 


4+3 

es 50 + 50 
When performing a logistic classification, it would be preferable to have an estimate# 
of the misclassification probabilities using the jackknife (holdout) approach but this: 
is not currently available in the major statistical software packages. : 

We could have continued the analysis in Example 11.17 by dropping gender anda 
using just the freshwater and marine growth measurements, However, when norm: 4 
distributions with equal covariance matrices prevail, logistic classification can beg 
quite inefficient compared to the normal theory linear classifier (see [7]). ioe 


X 100 = 7% 


3 


a 


Logistic Regression with Binomial Responses 


We now consider a slightly more general case where several runs are made at th 
same values of the covariates z; and there are a total of m different sets where these. 
predictor variables are constant. When n; independent trials are conducted with= 
the predictor variables z;, the response Y; is modeled as a binomial distributio 
with probability p(z,) = P(Success | z,). 

Because the Y, are assumed to be independent, the likelihood is the product 


L(Bo, Bi, ---» Br) = I (Foren ~ p(y" (11-78)- 


where the probabilities p(z;) follow the logit model (11-72) 


ny 
49 


PANEL 11.2 SAS ANALYSIS FOR SALMON DATA USING PROC LOGISTIC. 


title ‘Logistic Regression and Discrimination’; 


data salmon; 

inne Mire tar PROGRAM COMMANDS 
input country gender freshwater marine; 

proc logistic desc 

model country = gender freshwater marine / expb; 


OUTPUT 
Logistic Regression and Discrimination 
The LOGISTIC procedure 


Model Information 


Model binary logit 


Response Profile 


Ordered Total 
Value country Frequency 

1 2 50 

2 1 50 


(continues on next pagel 


Logistic Regression and Classification 641 


PANEL 11.2 (continued) 


Probability modeled is country = 2. 
Model Fit Statistics 


Intercept 

Intercept and 

Criterion Only Covariates 
AIC 140.629 46.674 

sc 143.235 57.094 

-2 Logl 138.629 38.674 


Testing Global Null Hypothesis: BETA = 0 
Test Chi-Square DF Pr > ChiSq 
9557 : 


Wald 19.4435 3 0.0002 


The LOGISTIC Procedure 


Analysis of Maximum Likelihood Estimates 


The maximum likelihood estimates B must be obtained numerically because 
there is no closed form expression for their computation. When the total sample size 
is large, the approximate covariance matrix Cov( B) is 


_ 4 
Cov(B) © | Smacene = pee a, (11-79) 
= 


and the i-th diagonal element is an estimate of the variance of fie It’s square root 
is an estimate of the large sample standard error SE (;44). 

It can also be shown that a large sample estimate of the variance of the proba- 
bility p(z;) is given by 


a -1 
Var(B(zx)) © (B(ze)(1 - ptean'e/| Smpcena = ateen | LE 
iF 5 


Consideration of the interval plus and minus two estimated standard deviations 
from p(z;) may suggest observations that are difficult to classify. 


642 Chapter 11 Discrimination and Classification 


Model Checking. Once any model is fit to the data, it is good practice to investigate 
the adequacy of the fit. The following questions must be addressed. 


¢ Is there any systematic departure from the fitted logistic model? 

e Are there any observations that are unusual in that they don’t fit the overaj] 
pattern of the data (outliers)? 

e Are there any observations that lead to important changes in the statistica] 
analysis when they are included or excluded (high influence)? 


If there is no parametric structure to the single trial probabilities p(z;) = 
P (Success | z;), each would be estimated using the observed number of successes 
(1’s) y; in n; trials. Under this nonparametric model, or saturated model, the contri- 
bution to the likelihood for the j-th case is 


J 


(“onenc = p(x) 


which is maximized by the choices p(z;) = yj/n;for j = 1,2,...,n. Here m = Xn, 
The resulting value for minus twice the maximized nonparametric (NP) likelihood 
is 


m Jj Jj m n, 
—2 In Linax,wp = —2>)| yjIn{ = | + (nj — yj) nf 1-=] [+ 2in{ TI]? 
i=l nj my 771 \ Yj 


(11-80) 


The last term on the right hand side of (11-80) is common to all models. 
We also define a deviance between the nonparametric model and a fitted model 
having a constant and r—1 predicators as minus twice the log-likelihood ratio or 


m Pe nya . 
G=2> [yin (2) + (nj - y) in( — 2) (11-81) 
j=) yj nj ~ Yj 


I 


where }; = n;p(z;) is the fitted number of successes. This is the specific deviance 
quantity that plays a role similar to that played by the residual (error) sum of 
squares in the linear models setting. 

For large sample sizes, G? has approximately a chi square distribution with f 
degrees of freedom equal to the number of observations, m, minus the number of 
parameters B estimated. 

Notice the deviance for the full model, G?,,,, and the deviance for a reduced 


model, G%-duceq lead to a contribution for the extra predictor terms 


L fe duct 
GReduced = Ghat = —2In (Fete) (11-82) 
max 


This difference is approximately x” with degrees of freedom df = dfpe duced — Of Full: 
A large value for the difference implies the full model is required. 

When m is large, there are too many probabilities to estimate under the non- 
parametic model and the chi-square approximation cannot be established by exist- 
ing methods of proof. It is better to rely on likelihood ratio tests of logistic models 
where a few terms are dropped. 


Logistic Regression and Classification 643 


Residnals and Goodness-of-Fit Tests. Residuals can be inspected for patterns that 
suggest lack of fit of the logit model form and the choice of predictor variables (co- 
variates). In logistic regression residuals are not as well defined as in the multiple re- 
gression models discussed in Chapter 7. Three different definitions of residuals are 
available. 


Deviance residuals (d)): 


di ny i 

dj= + 2) sn (25) + (n; — y;)In (5) 
; : njP(z;) ; i nj(1 — p(z,)) 

where the sign of d, is the same as that of y; — njp(z;) and, 


if yj = 0, then d; = V2njiln (1 = p(z;))| 
if y; = nj, then d; =-V 2n;\\n p(z,))| (11-83) 


yy ~ ny P (z;) 


Pearson residuals(r;): a (11-84) 
Vinjp(z;)(1 — p(z;)) 
v r; 
Standardized Pearson residuals (r,;): rj = ~~ (11-85) 


Vi- hj 
where h,,; is the (j,)th element in the “hat” matrix H given by equation (11-87). 
Values larger than about 2.5 suggest lack of fit at the particular z ;. 

An overall test of goodness of fit—preferred especially for smaller sample 
sizes—is provided by Pearson’s chi square statistic 


m 2 (y; — nyp(z,)) 
X= Sy = Se 11-86 
> '  injp(z;)(1 - (z;)) ee 


Notice that the chi square statistic, a single number summary of fit, is the sum of the 
squares of the Pearson residuals. Inspecting the Pearson residuals themselves allows 
us to examine the quality of fit over the entire pattern of covariates. 

Another goodness-of-fit test due to Hosmer and Lemeshow [17] is only applic- 
able when the proportion of observations with tied covariate patterns is small and 
all the predictor variables (covariates) are continuous. 


Leverage Points and Influential Observations. The logistic regression equivalent of 
the hat matrix H contains the estimated probabilities p,(z,). The logistic regression 
version of leverages are the diagonal elements A ;; of this hat matrix. 


H = V"?Z(2'V1Z)1Z'Vv2 (11-87) 


where V7 is the diagonal matrix with (j, /) element n;f(z;)(1 — p(z,)), V2” is the 
diagonal matrix with (j, j) element Vn, p(z;)(1 — p(z;)). 

Besides the leverages given in (11-87), other measures are available. We de- 
scribe the most common called the delta beta or deletion displacement. \t helps iden- 
tify observations that, by themselves, have a strong influence on the regression 


644 Chapter 11 Discrimination and Classification 


estimates. This change in regression coefficients, when all observations with the 
same covariate values as the j-th case z, are deleted, is quantified as 
2 
rsj hij 
AB; = ———— ‘ 
B; 1- hj (11 88) 


A plot of Af; versus j can be inspected for influential cases. 


11.8 Final Comments 


Including Qualitative Variables 


Our discussion in this chapter assumes that the discriminatory or classificatory vari- 
ables, X,, X2,..-, X, have natural units of measurement. That is, each variable can, 
in principle, assume any rea] number, and these numbers can be recorded. Often, a 
qualitative or categorical variable may be a useful discriminator (classifier). For ex- 
ample, the presence or absence of a characteristic such as the color red may be a 
worthwhile classifier. This situation is frequently handled by creating a variable X 
whose numerical value is 1 if the object possesses the characteristic and zero if the 
object does not possess the characteristic. The variable is then treated like the mea- 
sured variables in the usual discrimination and classification procedures. 

Except for logistic classification, there is very little theory available to handle the 
case in which some variables are continuous and some qualitative. Computer simula- 
tion experiments (see [22]) indicate that Fisher’s linear discriminant function can per- 
form poorly or satisfactorily, depending upon the correlations between the qualitative 
and continuous variables. As Krzanowski [22] notes, “A low correlation in one popu- 
lation but a high correlation in the other, or a change in the sign of the correlations be- 
tween the two populations could indicate conditions unfavorable to Fisher’s linear 
discriminant function.” This is a troublesome area and one that needs further study. 


Classification Trees 


An approach to classification completely different from the methods discussed in 
the previous sections of this chapter has been developed. (See [5].) It is very com- 
puter intensive and its implementation is only now becoming widespread. The new 
approach, called classification and regression trees (CART), is closely related to di- 
visive clustering techniques. (See Chapter 12.) 

Initially, all objects are considered as a single group. The group is split into two 
subgroups using, say, high values of a variable for one group and low values for the 
other. The two subgroups are then each split using the values of a second variable. 
The splitting process continues until a suitable stopping point is reached. The values 
of the splitting variables can be ordered or unordered categories. It is this feature 
that makes the CART procedure so general. 

For example, suppose subjects are to be classified as 


77: heart-attack prone 
772: not heart-attack prone 


on the basis of age, weight, and exercise activity. In this case, the CART procedure 
can be diagrammed as the tree shown in Figure 11.17. The branches of the tree actually 4 


Final Comments 645 


Overweight 


Exercise 
regularly 


™, : Heart-attack prone 
1 : Not heart-attack prone Figure 11.17 A classification tree. 


correspond to divisions in the sample space. The region Rj, defined as being over 45, 
being overweight, and undertaking no regular exercise, could be used to classify a 
subject as 7: heart-attack prone. The CART procedure would try splitting on 
different ages, as well as first splitting on weight or on the amount of exercise. 

The classification tree that results from using the CART methodology with the 
Iris data (see Table 11.5), and variables X3 = petal length (PetLength) and 
X4 = petal width (PetWidth), is shown in Figure 11.18. The binary splitting rules are 
indicated in the figure. For example, the first split occurs at petal length = 2.45. 
Flowers with petal lengths = 2.45 form one group (left), and those with petal 
lengths > 2.45 form the other group (right). 


s245 
Node i 
PetLength 
N= 150 


= 1.75 
Node 2 
PetWidth 


Figure 11.18 A classification tree 
for the Iris data. 


646 Chapter 11 Discrimination and Classification 


The next split occurs with the right-hand side group (petai length > 2.45) at 
petal width = 1.75. Flowers with petal widths = 1.75 are put in one group (left), 
and those with petal widths >-1.75 form the other group (right). The process con- 
tinues until there is no gain with additional splitting. In this case, the process stops 
with four terminal nodes (TN). 

The binary splits form terminal node rectangles (regions) in the positive 
quadrant of the X3, X, sample space as shown in Figure 11.19. For example, TN #2 
contains those flowers with 2.45 < petal lengths = 4.95 and petal widths = 1.75__ 
essentially the Iris Versicolor group. 

Since the majority of the flowers in, for example, TN #3 are species Virginica, a 
new item in this group would be classified as Virginica. That is, TN #3 and TN #4 are 
both assigned to the Virginica population. We see that CART has correctly classified 
50 of 50 of the Setosa flowers, 47 of 50 of the Versicolor flowers, and 49 of 50 of the 
Virginica flowers. The APER = a = .027. This result is comparable to the result 
obtained for the linear discriminant analysis using variables X3 and X, discussed in 
Example 11.12. 

The CART methodology is not tied to an underlying population probability 
distribution of characteristics. Nor is it tied to a particular optimality criterion. In 
practice, the procedure requires hundreds of objects and, often, many variables. 
The resulting tree is very complicated. Subjective judgments must be used to 
prune the tree so that it ends with groups of several objects rather than all 
single objects. Each terminal group is then assigned to the population holding the ma- 
jority membership. A new object can then be classified according to its ultimate group. 

Breiman, Friedman, Olshen, and Stone [5] have develaped special-purpose 
software for implementing a CART analysis. Also, Loh (see [21] and [25]) has de- 
veloped improved classification tree software called QUEST” and CRUISE." 
Their programs use several intelligent rules for splitting and usually produces a 
tree that often separates groups well. CART has been very successful in data min- 
ing applications (see Supplement 12A). 

x 1 | Setosa 


+ 2 | Versicolar 
O_3 | Virginica 


PetLength 


0.0 0.5 1.0 1.5 2.0 2.5 
PetWidth 


Figure 11.19 Classification tree terminal nodes (regions) in the petal width, petal 
length sample space. 


13 Available for download at www.stat.wisc.edu/~loh/quest.html 
14 Available for download at www.stat.wisc.edu/~loh/cruise.html 


Final Comments 647 


Neural Networks 


A neural network (NN) is a computer-intensive, algorithmic procedure for 
transforming inputs into desired outputs using highly connected networks of 
relatively simple processing units (neurons or nodes). Neural networks are modeled 
after the neural activity in the human brain. The three essential features, then, of an 
NN are the basic computing units (neurons or nodes), the network architecture 
describing the connections between the computing units, and the training 
algorithm used to find values of the network parameters (weights) for performing a 
particular task. 

The computing units are connected to one another in the sense that the out- 
put from one unit can serve as part of the input to another unit. Each computing 
unit transforms an input to an output using some prespecified function that is 
typically monotone, but otherwise arbitrary. This function depends on constants 
(parameters) whose values must be determined with a training set of inputs and 
outputs. 

Network architecture is the organization of computing units and the types of 
connections permitted. In statistical applications, the computing units are arranged 
in a series of layers with connections between nodes in different layers, but not be- 
tween nodes in the same layer. The layer receiving the initial inputs is called the 
input layer. The final layer is called the output layer. Any layers between the input 
and output layers are called hidden layers. A simple schematic representation of a 
multilayer NN is shown in Figure 11.20. 


Output 


Middle (hidden) 


Input 


i t t f 


X, X; 


Figure 11.20 A neural network with one hidden layer. 


648 Chapter 11 Discrimination and Classification 


Neural networks can be used for discrimination and classification. When they 
are so used, the input variables are the measured group characteristics Xj, 
X>,...,X,, and the output variables are categorical variables indicating group 
membership. Current practical experience indicates that properly constructed neur- 
al networks perform about as well as logistic regression and the discriminant func- 
tions we have discussed in this chapter. Reference [30] contains a good discussion of 


the use of neural networks in applied statistics. 


Selection of Variables 


In some applications of discriminant analysis, data are available on a large number 
of variables. Mucciardi and Gose [27] discuss a discriminant analysis based on 157 
variables.) In this case, it would obviously be desirable to select a relatively small 
subset of variables that would contain almost as much information as the original 
collection. This is the objective of stepwise discriminant analysis, and several popular 
commercial computer programs have such a capability. 

If a stepwise discriminant analysis (or any variable selection method) is 
employed, the results should be interpreted with caution. (See [28].) There is no’ 
guarantee that the subset selected is “best,” regardless of the criterion used to make 
the selection. For example, subsets selected on the basis of minimizing the apparent 
error rate or maximizing “discriminatory power” may perform poorly in future 
samples. Problems associated with variable-selection procedures are magnified if 
there are large correlations among the variables or between linear combinations of 
the variables. 

Choosing a subset of variables that seems to be optimal for a given data set is 
especially disturbing if classification is the objective. At the very least, the derived 
classification function should be evaluated with a validation sample. As Murray [28] 
suggests, a better idea might be to split the sample into a number of batches and 
determine the “best” subset for each batch. The number of times a given variable 
appears in the best subsets provides a measure of the worth of that variable for 
future classification. 


Testing for Group Differences 


We have pointed out, in connection with two group classification, that effective allo- 
cation is probably not possible unless the populations are well separated. The same 
is true for the many group Situation. Classification is ordinarily not attempted, un- 
less the population mean vectors differ significantly from one another. Assuming 
that the data are nearly multivariate normal, with a common covariance matrix, 
MANOVA can be performed to test for differences in the population mean vectors. 
Although apparent significant differences do not automatically imply effective clas- 
sification, testing is a necessary first step. If no significant differences are found, con- 
structing classification rules will probably be a waste of time. 


15Ymagine the problems of verifying the assumption of 157-variate normality and simultaneously 
estimating, for example,.the 12,403 parameters of the 157 X 157 presumed common covariance matrix! 


Final Comments 649 


Graphics 


Sophisticated computer graphics now allow one visually to examine multivariate 
data in two and three dimensions. Thus, groupings in the variable space for any 
choice of two or three variables can often be discerned by eye. In this way, poten- 
tially important classifying variables are often identified and outlying, or “atypical,” 
observations revealed. Visual displays are important aids in discrimination and clas- 
sification, and their use is likely to increase as the hardware and associated comput- 
er programs become readily available. Frequently, as much can be learned from a 
visual examination as by a complex numerical analysis. 


Practical Considerations Regarding Multivariate Normality 


The interplay between the choice of tentative assumptions and the form of the re- 
sulting classifier is important. Consider Figure 11.21, which shows the kidney- 
shaped density contours from two very nonnormal densities. In this case, the normal 
theory linear (or even quadratic) classification rule will be inadequate compared to 
another choice. That is, linear discrimination here is inappropriate. 

Often discrimination is attempted with a large number of variables, some of 
which are of the presence—absence, or 0-1, type. In these situations and in others 
with restricted ranges for the variables, multivariate normality may not be a sensible 
assumption. As we have seen, classification based on Fisher’s linear discriminants 
can be optimal from a minimum ECM or minimum TPM point of view only when 
multivariate normality holds. How are we to interpret these quantities when nor- 
mality is clearly not viable? 

In the absence of multivariate normality, Fisher’s linear discriminants can be 
viewed as providing an approximation to the total sample information. The values 
of the first few discriminants themselves can be checked for normality and rule 
(11-67) employed. Since the discriminants are linear combinations of a large num- 
ber of variables, they will often be nearly normal. Of course, one must keep in mind 
that the first few discriminants are an incomplete summary of the original sample in- 
formation. Classification rules based on this restricted set may perform poorly, while 
optimal rules derived from all of the sample information may perform well. 


%, 
“Linear classification” boundary 
“Good classification” boundary 
Contour of Contour of f; (x) 
F(x) 
Ry R 
x \ 
/ * Figure 11.21 Two nonnormal 
. populations for which linear 


—_————> x, | discrimination is inappropriate. 


650 Chapter 11 Discrimination and Classification 


EXERCISES 


tI.f. Consider the two data sets 


3 7 6 9 
1=!2 4] and X,=|5 7 ra 
47 4 8 : 
for which 
= 3 _ 5 
a-[e} a-[ 
and 


11 
Spooled = E | 


(a) Calculate the linear discriminant function in (11-19). 
(b) Classify the observation x9 = [2 7] as population 7, or population 7, using Rulé: 
(11-18) with equal priors and equal costs. 4 
11,2. (a) Develop a linear classification function for the data in Example 11.1 using (11-19). 


(b) Using the function in (a) and (11-20), construct the “confusion matrix” by classifying 
the given observations. Compare your classification results with those of Figure 11.1, 
where the classification regions were determined “by eye.” (See Example 11.6.) 


(c) Given the results in (b), calculate the apparent error rate (APER). : 
(d) State any assumptions you make to justify the use of the method in Parts a and b. “a 


11.3. Prove Result 11.1. : 
Hint: Substituting the integral expressions for P(2!1) and P(11!2) given by (11-1) and. 
(11-2), respectively, into (11-5) yields 4 


ECM = c(2it)p: f fi(x) dx + c(1 12) | fo(x)ax 
R2 Ry 
Noting that 2 = R, U Ro, so that the total probability 


1= [fico ae = | aloe [nines 


we can write 


ECM = c(2't)pi{ 1 ~ iL fle) | + c(LI2)p2 ff fox) dx 


By the additive property of integrals (volumes), 
on / [e(1!2)p2fo(x) ~ ¢(211) pi fi(x)] ax + (211) 1 
Ri 


Now, p1, pz, c(112), and c(211) are nonnegative. In addition, f,(x) and f,(x) are non 
negative for all x and are the only quantities in ECM that depend on x. Thus, ECM i 
minimized if R, includes those values x for which the integrand 


[c(112) po fo(x) ~ ¢(211) pi f,(x)] = 0 


and excludes those x for which this quantity is positive. 


11.4. 


11.7. 


Exercises 651 


A researcher wants to determine a procedure for discriminating between two multivari- 
ate populations. The researcher has enough data available to estimate the density 
functions f(x) and f,(x) associated with populations 7, and 72, respectively. Let 
e(211) = 50 (this is the cost of assigning items as 7, given that m, is true) and 
e(112) = 100. 

In addition, it is known that about 20% of all possible items (for which the 
measurements x can be recorded) belong to 772. 


(a) Give the minimum ECM rule (in general form) for assigning a new item to one of 
the two populations. 


(b) Measurements recorded on a new item yield the density values f,(x) = .3 and 
f2(x) = .5. Given the preceding information, assign this item to population 7 or 
population 77>. 


Show that 


—3(x ~ py)" — py) + 3 (x — B2)'E "(x — pp) 

= (oy — B2)'E x = 3 (oa — B2)'E "(Mr + B2) 
[see Equation (11-13).] 
Consider the linear function ¥Y = a’X. Let E(X) = py and Cov(X) = & if X belongs 
to population 7,. Let E(X) = jz and Cov(X) = & if X belongs to population 72. Let 


m = 3(uiy + Moy) = 3(a’M1 + a’py). Given that a’ = (j2; — p2)'Z71, show each 
of the following. 

(a) E(a’X|a,) — m =a'p, -m>0 

(b) E(a’X|772) —-m a'py2-m<0 

Hint: Recall that & is of full rank and is positive definite, so £~! exists and is positive 
definite. 

Let fi(x) = (1 — | x |) for|x| = land f,(x) = (1 -|x- 5|)for-—S =x = 15. 
(a) Sketch the two densities. 

(b) Identify the classification regions when p; = pz and c(1!2) = c(211). 

(c) Identify the classification regions when p; = .2 and c(112) = c(2!1). 


Refer to Exercise 11.7. Let f,(x) be the same as in that exercise, but take 
fox) = 4 (2 - |x - 5]) for-15 <x < 25. 

(a) Sketch the two densities. 

(b) Determine the classification regions when p, = pz and c(1!2) = c(2/1). 


For g = 2 groups, show that the ratio in (11-59) is proportional to the ratio 


squared distance . 
betweenmeansof¥/ — (M1y — bay)” _ (a'p, — a'p2) 
(variance of Y) oy ~ a’da 


, t , 2 
_ a'(#1 ~ #2)(Hi ~ H2)'a _ (a6) 
a’'da a’La 
where 6 = (1 — #2) is the difference in mean vectors. This ratio is the population 
counterpart of (11-23). Show that the ratio is maximized by the linear combination 


a= cE 16 = c¥ "(py — 2) 
for any c # 0. 


65Z Chapter 11 Discrimination and Classification 


11.10. 


11.13. 


11.14. 


11.15. 


P(B1!A2) | P(B21A1)| P(Aland B2) | P(A2and B1) 


Hint: Note that (m; — #)(m; — #)' = 4(#1 ~ M2)(# ~ M2)’ for i = 1,2, where 
B= 5(e1 + p2). 


Suppose that nm, = 11 and nz = 12 observations are made on two random variables Xx, 
and X,, where X, and X are assumed to have a bivariate normal distribution with a 
common covarialce matrix Z, but possibly different mean vectors #4, and #2, for the two 
samples. The sample mean vectors and pooled covariance matrix are 


2 -1 2 2 
Tu: Sey 
_ | 73 -11 
Spooted =| 11 48 
(a) Test for the difference in population mean vectors using Hotelling’s two-sample 
T?-statistic. Let a = .10. 


(b) Construct Fisher’s (sample) linear discriminant function. [See (11-19) and (11-25),] 


(c) Assign the observation xj = [0 1] to either population 7, or 72. Assume equal 
costs and equal prior probabilities. 


Suppose a univariate random variable X has a normal distribution with variance 4. If X 
is from population 7, its mean is 10; if it is from population 72, its mean is 14. Assume 
equal prior probabilities for the events Al = X is from population 7, and A2 = X is 
from population 72, and assume that the misclassification costs ¢(2!1) and c(1!2) are 
equal (for instance, $10). We decide that we shall allocate (classify) X to population 77, if 
X = c, for some c to be determined, and to population 72 if X > c. Let B1 be the 
event X is classified into population 7, and B2 be the event X is classified into popula- 
tion 72. Make a table showing the following: P(B1|A2), P(B2|A1), P(A1 and B2), 
P(A2and 21); P(misclassification), and expected cost for various values of c. For what 
choice of c is expected cost minimized? The table should take the following form: 


Expected 


P(error) cost 


What is the value of the minimum expected cost? 


Repeat .Exercise 11.11 if the prior probabilities of Al and A2 are equal, but 
c(211) = $5 and c(1|2) = $15. 


Repeat Exercise 11.11 if the prior probabilities of Al and A2 are P(A1) = .25 and 
P(A2) = .75 and the misclassification costs are as in Exercise 11.12. 


Consider the discriminant functions derived in Example 11.3. Normalize 4 using (11-21) 
and (11-22). Compute the two midpoints 7} and m3 corresponding to the two choices of 
normalized vectors, say, 4] and 45. Classify x} = [-.210, -.044] with the function 
Yo = &°'Xp for the two cases. Are the results consistent with the classification obtained 
for the case of equal prior probabilities in Example 11.3? Should they be? 


Derive the expressions in (11-27) from (11-6) when f(x) and f(x) are multivariate 
normal densities with means #4,, #2 and covariances L,, Xz, respectively. 


11.16. 


11.17. 


11.18. 


Exercises 653 


Suppose x comes from one of two populations: 
71: Normal with mean yz, and covariance matrix 2, 


72: Normal with mean s2 and covariance matrix Z 


If the respective density functions are denoted by f\(x) and f(x), find the expression 
for the quadratic discriminator 


If 2, = XZ, = &, for instance, verify that Q becomes 
(m1 — B2)'Z x — 5 (m1 — oe)’ E (my + M2) 


Suppose populations 7, and 72 are as follows: 


Population 
Ty 12 
Distribution Normal Normal 
Mean pt [10, 15]' [10,25]’ 
: 18 12 20 ~-7 
Covariance & Ee | ie ‘| 


Assume equal prior probabilities and misclassifications costs of c(2!1) = $10 and 
c(1\2) = $73.89. Find the posterior probabilities of populations 7, and 72, P(71!x) 
and P(71x), the value of the quadratic discriminator Q in Exercise 11.16, and the 
classification for each value of x in the following table: 


x P(a 1x) P(a2I|x) Q Classification 


[10, 15]' 
[42, 17]' 


[30, 35]’ 


(Note: Use an increment of 2 in each coordinate—11 points in all.) 


Show each of the following on a graph of the x, x2 plane. 

(a) The mean of each population 

(b) The ellipse of minimal] area with probability .95 of containing x for each population 
(c) The region R, (for population 7,) and the region Q~R, = R, (for population 72) 
(d) The 11 points classified in the table 


If B is defined as c(t; — @2)(mM1 — M2)’ for some constant c, verify that 
e = c¥!(m, — m2) is in fact an (unscaled) eigenvector of £~'B, where & is a covari- 


"ance matrix. 


11.19. 


(a) Using the original data sets X, and X, given in Example 11.7, calculate x;, 8;, 
i= 1,2, and Spooea, verifying the results provided for these quantities in the 
example. 


654 Chapter 11 Discrimination and Classification 


11.20. 


11.21. 


(b) Using the calculations in Part a, compute Fisher’s linear discriminant function, and. 
use it to classify the sample observations according to Rule (11-25), Verify that the » 
confusion matrix given in Example 11.7 is correct. i 


(c) Classify the sample observations on the basis of smallest squared distance D?(x) of 
the observations from the group means X, and X. [See (11-54),] Compare the res 
sults with those in Part b. Comment. nee 


The matrix identity (see Bartlett [3]) 


n- 


3 Ck 
<1 = “J 
S77, pooled n-2 Spooted + 


1 — cy(X — %k)'Spooted (XH — Xq) 


*Spooled (Kv — Ky) (xy — 8) Side) 
where 

Nk 
“k (mm = 1)(n = 2) 


allows the calculation of $7 pootea from Speoiea- Verify this identity using the data from 
Example 11.7. Specifically, set n =n, + 12, k=1, and x}, = [2,12]. Calculate 
Si pooled USing the full data Sedojeq and ¥), and compare the result with $7} pooled in 
Example 11.7. , 


Let A, = A, 2=+:- 2A, > 0 denote the s = min(g — 1, p) nonzero eigenvalues of 


X7'B, and e;,e2,...,e, the corresponding eigenvectors (scaled so that e’Ze = 1). 
Show that the vector of coefficients a that maximizes the ratio 


a’B,a a | Sn a) - a)’ ]a 
ata ata 


is given by a; = e,. The linear combination a{X is called the first discriminant. Show 
that the value a, = e, maximizes the ratio subject to Cov(a|X,a5X) = 0. The linear 
combination a2X is called the second discriminant. Continuing, a, = e, maximizes the 
ratio subject to 0 = Cov(ayX,aiX), i < k, and ajX is called the kth discriminant. 
Also, Var (a;X) = 1,i = 1,...,s. [See (11-62) for the sample equivalent.] 

Hint: We first convert the maximization problem to one already solved. By the spectral 
decomposition in (2-20), % = P’AP where A is a diagonal matrix with positive 
elements A,. Let A'/? denote the diagonal matrix with elements ‘VA;. By (2-22), the 
symmetric square-root matrix £1/2 = P’A'/P and its inverse E-'? = p’A~!/P satisfy 
WW? =X WPY MW =p = yy? and YE? = XE! Next, set 


u = {17a 


sou’u=a'%'/?2'/7q =a’ Ya and u'S|/?B, 7a =a’ E22 BSE a= Ba 
Consequently, the problem reduces to maximizing 
ip all a ca 


over u. From (2-51), the maximum of this ratio is A,, the largest eigenvalue of 
2B, 2°”. This maximum occurs when u =e, the normalized eigenvecto! 


11.22. 


Exercises 655 


associated with A,. Because e, = u = ©'/?a), or a, = &/e), Var(ajX) = ada, = 
e(t 72 r Ve, = ef E172 '73!73-'e, = ele; = 1. By (2-52), 1 e; maximizes the 
preceding ratio when u = eg, the normalized eigenvector corresponding to A2. For this 
choice, ay = &/e,, and Cov(ajX,a}X) = afZa, = DEE 'e, = efe, = 0, 
since e, 1 e,. Similarly, Var(a}X)= a5Za) = e5e, = 1. Continue in this fashion for 
the remaining discriminants, Note that if A and e are an eigenvalue-eigenvector pair 
of £78,277, then 


xB, Ze = Ae 
and multiplication on the left by &~1/” gives 
Sey BT Pe = AL e or 2 B,(E"7e) = (Ze) 
Thus, 2~'B,, has the same eigenvalues as X~!/*B,,&'/?, but the corresponding eigenvec- 
tor is proportional to £~'/?e = a, as asserted. 


Show that A2 = Ay + Ap t-°° + Ap = Ay + Ap +--+ + Ay, where Ay, Az,...,A, are the 


nonzero eigenvalues of £'B, (or &7'/7B,, %'/) and A? is given by (11-68). Also, show 
that Ay + A, +--+ + A, is the resulting separation when only the first r discriminants, 
Y,, ¥o,.-., ¥, are used. 


Hint: Let P be the orthogonal matrix whose ith row e} is the eigenvector of = ?B Pe ae 2 


corresponding to the ith largest eigenvalue,i = 1,2,..., p. Consider 
Yi] [ete ?x] 
Y =(¥,|=| ex7?x | = px 2x 
(pX1) : : 
Y, ent x 


Now, wiy = E(¥|7;) = PE"?p, and py = PX-/?p, so 
(Miy — By)’ (Miy ~ My) = (mi — B)'EOPP PE? (un; — f) 


= (mi ~ B)'E "(Hi ~ B) 


8 
Therefore, A$ = >) (Hiy — Hy)'(#,y — By). Using ¥;, we have 
1=1 


& & 
> (Hy, — By) = Seta; — a) (ei — BYE ey 


i=1 i=} 


= ejt 7B Ee, =, 


because e, has eigenvalue A,. Similarly, ¥; produces 


3 
> (Hiv, ~ Hy,)? = 327 B,E Ve, = dy 


i=1 


and Y, produces 


I 


& 

re 2 / =, os — 
> (ni, a ity,) ey 12B% We, = Ap 
i= 


656 Chapter 11 Discrimination and Classification 


Thus, 
A= > (ny = By) le ay) 
ae ‘ é 
7 2 (Hiy, — juy,) + > (Hiv, ~ iiy,)” eames > (Hiy, ~ #y,) 
SAP tA, te tApE At ag t-n tA, 
since A,4) = --: = Ap = 0. If only the first r discriminants are used, their contribution to 


Ais A, + Ap +--+ + A,. 


The following exercises require the use of a computer. 


11.23. Consider the data given in Exercise 1.14. 

(a) Check the marginal distributions of the x;’s in both the multiple-sclerosis (Ms) 
group and non-multiple-sclerosis (NMS) group for normality by graphing the 
corresponding observations as normal probability plots. Suggest appropriate data 
transformations if the normality assumption is suspect. 

(b) Assume that 2, = 22 = %. Construct Fisher’s linear discriminant function. Do all 
the variables in the discriminant function appear to be important? Discuss your 
answer. Develop a classification rule assuming equal prior probabilities and equal 
costs of misclassification. 

(c) Using the results in (b), calculate the apparent error rate. If computing resources 
allow, calculate an estimate of the expected actual error rate using Lachenbruch’s 
holdout procedure. Compare the two error rates. 


11.24. Annual financial data are collected for bankrupt firms approximately 2 years prior to their 
bankruptcy and for financially sound firms at about the same time. The data on four vari- 
ables, X; = CF/TD = (cash flow)/(total debt), X, = NI/TA = (net income)/(total as- 
sets), X3 = CA/CL = (current assets)/(current liabilities), and X, = CA/NS = (current 
assets)/(net sales), are given in Table 11.4. 

(a) Using a different symbol for each group, plot the data for the pairs of observations 
(%,%2), (41,243) and (x1,x4). Does it appear as if the data are approximately 
bivariate normal for any of these pairs of variables? 

(b) Using the n, = 21 pairs of observations (x, , x2) for bankrupt firms and the nz = 25 
pairs of observations (x, x2) for nonbankrupt firms, calculate the sample mean vec- 
tors X, and X2 and the sample covariance matrices S; and S,. 

(c) Using the results in (b) and assuming that both random samples are from bivariate 
normal populations, construct the classification rule (11-29) with p, = pz and 
c(112) = ¢(211). 

(d) Evaluate the performance of the classification rule developed in (c) by computing 
the apparent error rate (APER) from (11-34) and the estimated expected actual 
error rate E (AER) from (11-36). 

(e) Repeat Parts c and d, assuming that p, = .05, p) = 95, and c{1!2) = ¢(2!1). Is 
this choice of prior probabilities reasonable? Explain. 

(f) Using the results in (b), form the pooled covariance matrix S,ooieq, and construct 
Fisher’s sample linear discriminant function in (11-19). Use this function to classify 
the sample observations and evaluate the APER. Is Fisher’s linear discriminant 
function a sensible choice for a classifier in this case? Explain. 

(g) Repeat Parts b~e using the observation pairs (x,,x3) and (x,, x4). Do some vari- 
ables appear to be better classifiers than others? Explain. 

(h) Repeat Parts b-e using observations on all four variables (X,, Xz, X3, X4)- 


Exercises 657 


Table 11.4 Bankruptcy Data 


Raw ‘eo oe nach r= Population 
* STD 2 TA umes o * "NS mini = 1,2 
1 —.45 —Al 1.09 45 0 
2 —.56 ~.31 1.51 16 0 
3 06 02 1.01 40 0 
4 —.07 —.09 1.45 26 0 
5 ~—.10 —.09 1.56 67 0 
6 -.14 —.07 71 28 0 
7 .04 O1 1.50 71 0 
8 — .06 —.06 1.37 40 0 
9 07 -.01 1.37 34 0 
10 —.13 —.14 1.42 44 0 
11 —.23 — 30 33 18 0 
12 07 02 1.31 25 0 
13 01 .00 2.15 70 0 
14 —.28 —.23 1.19 66 0 
15 15 05 1.88 27 0 
16 37 Al 1.99 38 0 
17 —.08 ~—.08 1.51 42 0 
18 05 .03 1.68 95 0 
19 01 —.00 1.26 60 0 
20 12 11 1.14 17 0 
21 —.28 ~.27 1.27 51 0 
1 51 10 2.49 54 1 
2 08 02 2.01 53 1 
3 38 11 3.27 35 1 
4 19 05 2.25 33 1 
5 32 07 4.24 63 1 
6 31 05 4.45 69 1 
7 12 05 2.52 69 1 
8 ~.02 02 2.05 35 1 
9 22 08 2.35 40 1 
10 17 07 1.80 52 1 
11 15 O05 2.17 55 1 
12 —.10 -.01 2.50 58 1 
13 14 —.03 46 26 1 
14 14 07 2.61 52 1 
15 15 06 2.23 56 1 
16 16 05 2.31 20 1 
17 29 06 1.84 38 1 
18 54 11 2.33 48 1 
19 —.33 —.09 3.01 47 1 
20 48 09 1.24 18 1 
21 56 1 4.29 45 1 
22 20 .08 1.99 30 1 
23 47 .14 2.92 45 1 
24 17 -04 2.45 14 1 
25 58 04 5.06 13 1 


Legend: 7; = 0: bankrupt firms; 72 = 1: nonbankrupt firms. 
Source: 1968, 1969, 1970, 1971, 1972 Moody’s Industrial Manuals. 


658 Chapter 11 Discrimination and Classification 


11.25. The annual financial data listed in Table 11.4 have been analyzed by Johnson [19] with a 
view toward detecting influential observations in a discriminant analysis. Consider varj- 
ables X,; = CF/TD and X3 = CA/CL. 


(a) Using the data on variables X, and X3, construct Fisher’s linear discriminant func. 
tion. Use this function to classify the sample observations and evaluate the APER. 
[See (11-25) and (11-34).} Plot the data and the discriminant line in the (x,, x3) co- 
ordinate system. 


(b) Johnson [19] has argued that the multivariate observations in rows 16 for bankrupt 
firms and 13 for sound firms are influential. Using the X,, X3 data, calculate Fisher’s 
linear discriminant function with only data point 16 for bankrupt firms deleted. Re- 
peat this procedure with only data point 13 for sound firms deleted. Plot the respec- 
tive discriminant lines on the scatter in part a, and calculate the APERs, ignoring the 
deleted point in each case. Does deleting either of these multivariate observations 
make a difference? (Note that neither of the potentially influential data points js 
particularly “distant” from the center of its respective scatter.) 


11,26. Using the data in Table 11.4, define a binary response variable Z that assumes the value 
Oif a firm is bankrupt and 1 if a firm is not bankrupt. Let X¥ = CA/CL, and consider the 
straight-line regression of Z on X. 

(a) Although a binary response variable does not meet the standard regression assump- 
tions, consider using least squares to determine the fitted straight line for the X, Z 
data. Plot the fitted values for bankrupt firms as a dot diagram on the interval [0, 1}. 
Repeat this procedure for nonbankrupt firms and overlay the two dot diagrams. A 
reasonable discrimination rule is to predict that a firm will go bankrupt if its fitted 
value is closer to 0 than to 1. That is, the fitted value is less than .5. Similarly, a firm is 
predicted to be sound if its fitted value is greater than .5. Use this decision rule to 
classify the sample firms. Calculate the APER. 

(b) Repeat the analysis in Part a using all four variables, X;,..., X4. Is there any change 
in the APER? Do data points 16 for bankrupt firms and 13 for nonbankrupt firms 
stand out as influential? 

(c) Perform a logistic regression using all four variables. 


11.27. The data in Table 11.5 contain observations on X; = sepal width and X4 = petal width 
for samples from three species of iris. There are n| = mz = 13 = 50 observations in each 
sample. 

(a) Plot the data in the (+2, x4) variable space. Do the observations for the three groups 
appear to be bivariate normal? 


Table 11.5 Data on Irises 


a: dris setosa 72: Iris versicolor a3. Iris virginica 


(continues on next page) 


Exercises 659 


[ Table 11.5 (continued) 


| 2: Iris setosa 


7: Iris versicolor 


a3: Iris virginica 


Petal 


Petal 


Jength | width 


Sepal 
length | width 


Sepal 


Source: Anderson [1]. 


X2 X3 X% xy 
3.4 14 0.3 6.3 
3.4 1.5 0.2 4.9 
2.9 14 0.2 6.6 

; 1.5 0.1 5.2 
3.7 15 0.2 5.0 
3.4 1.6 0.2 5.9 : 
3.0 1.4 0.1 6.0 2.2 
3.0 1.1 0.1 6.1 2.9 
4.0 1.2 0.2 5.6 2.9 
4.4 1.5 0.4 6.7 3.1 
3.9 13 0.4 5.6 3.0 
3.5 1.4 0.3 5.8 2.7 
3.8 17 0.3 6.2 2.2 
3.8 1.5 0.3 5.6 2.5 
3.4 1.7 0.2 5.9 3.2 
3.7 1.5 0.4 6.1 2.8 
3.6 1.0 0.2 6.3 2.5 
3.3 1.7 0.5 6.1 2.8 
3.4 1.9 0.2 6.4 2.9 
3.0 1.6 0.2 6.6 3.0 
34 1.6 0.4 6.8 2.8 
3.5 1.5 0.2 6.7 3.0 
34 14 0.2 6.0 2.9 
3.2 1.6 0.2 5.7 2.6 
3.1 16 0.2 5.5 2.4 
3.4 1.5 0.4 5.5 2.4 
4.1 1.5 0.1 5.8 2.7 
4.2 1.4 0.2 6.0 2.7 


WEWWNWONDBNWWWUDNDANSDHOUDTIARWNUWAHNWOWNAWAO 


660 Chapter 11 Discrimination and Classification 


(b) Assume that the samples are from bivariate normal populations with a common 
covariance matrix. Test the hypothesis Ho: #; = fer = 3 versus H): at least one gu; 
is different from the others at the a = .0S significance level. Is the assumption of a 
common covariance matrix reasonable in this case? Explain. 

(c) Assuming that the populations are bivariate normal, construct ane quadratic 
discriminate scores d?(x) given by (11-47) with p; = p2 = p3 = 4. Using Rule 
(11-48), classify the new observation xp = [3.5 1.75] into population 77, 72, or 
73. 

(d) Assume that the covariance matrices &; are the same for all three bivariate normal 
populations. Construct the linear discriminate score d;(x) given by (11-51), and use 
it to assign xp = [3.5 1.75] to one of the populations 7;, i = 1,2,3 according to 
(11-52). Take p, = p2 = p3 = i . Compare the results in Parts c and d. Wien 
approach do you prefer? Explain. 

(e) Assuming equal covariance matrices and bivariate normal populations, and suppos- 
ing that p) = po = p3 = i, allocate xp = [3.5 1.75] to 71, 72, or 73 using Rule 
(11-56). Compare the result with that in Part d. Delineate the classification regions 
Ri, R2, and R3; on your graph from Part a determined by the linear functions 
dy (Xp) in (11-56). 

(f) Using the linear discriminant scores from Part d, classify the sample observations. 
Calculate the APER and E(AER). (To calculate the latter, you should use Lachen- 
bruch’s holdout procedure. [See (11-57).}) 


11.28. Darroch and Mosimann [6] have argued that the three species of iris indicated in 
Table 11.5 can be discriminated on the basis of “shape” or scale-free information alone. 
Let Y¥, = X,/X; be sepal shape and ¥, = X3/X, be petal shape. 

(a) Plot the data in the (log Y;,]og Y2) variable space. Do the observations for the three 
groups appear to be bivariate normal? 

(b) Assuming equal covariance matrices and bivariate normal populations, and 
supposing that p, = p2 = p3 = = construct the linear discriminant scores d, i(x) 
given by (11-51) using both variables log Y,, log Y2 and each variable individually. 
Calculate the APERs. 

(c) Using the linear discriminant functions from Part b, calculate the holdout estimates 
of the expected AERs, and fill in the following summary table: 


Variable(s) Misclassification rate 
log ¥; 


log ¥ 


log Yj, log Y2 


Compare the preceding misclassification rates with those in the summary tables in 
Example 11.12. Does it appear as if information on shape alone is an effective dis- 
criminator for these species of iris? 

(d) Compare the corresponding error rates in Parts b and c. Given the scatter plot in 
Part a, would you expect these rates to differ much? Explain. 


11.29. The GPA and GMAT data alluded to in Example 11.11 are listed in Table 11.6. 


(a) Using these data, calculate X,,%2, X¥3,x, and Spooleq and thus verify the results for 
these quantities given in Example 11.11. 


Exercises 661 


| Table 11.6 Admission Data for Graduate School of Business 


Applicant) GPA 
no. (x1) 
1 


72: Do not admit 73: Borderline 


GMAT | Applicant 


(b) Calculate W~' and B and the eigenvalues and eigenvectors of W~'B. Use the linear 
discriminants derived from these eigenvectors to classify the new observation 
x9 = [3.21 497] into one of the populations 7,: admit; 72: not admit; and 73: bor- 
derline. Does the classification agree with that in Example 11.11? Should it? Explain. 


11.30. Gerzild and Lantz [13] chemically analyzed crude-oil samples from three zones of sandstone: 
a7: Wilhelm 
72: Sub-Mulinia 
73: Upper 
The values of the trace elements 
X; = vanadium (in percent ash) 
X2 = iron (in percent ash) 
X3 = beryllium (in percent ash) 


662 Chapter 1] Discrimination and Classification 


and two measures of hydrocarbons, 
X, = saturated hydrocarbons (in percent area) 
Xs = aromatic hydrocarbons (in percent area) 


are presented for 56 cases in Table 11.7. The last two measurements are determined frogy= 
areas bad era gas-liquid Chromatography curve. 


adequacy of the assumption of normality. 


(b) Determine the estimate of E(AER) using Lachenbruch’s holdout procedure, od 
give the confusion matrix. = 


(c) Consider various transformations of the data to normality (see Example 11.14), ands 


repeat Parts a and b. 


Table 11.7 Crude-Oil Data 


(continues on next page) 


Exercises 663 


Table 11.7 (continued) 

x1 x2 x3 X4 X5 

10.0 18.0 0.10 3.06 7.67 
73 15.0 0.05 3.76 6.84 
9.5 22.0 0.30 3.98 5.02 
8.4 15.0 0.20 5.02 10.12 
8.4 17.0 0.20 4.42 8.25 
9.5 25.0 0.50 4.44 5.95 
72 22.0 1.00 4.70 3.49 
4.0 12.0 0.50 5.71 6.32 
6.7 52.0 0.50 4.80 3.20 
9.0 27.0 0.30 3.69 3.30 
78 29.0 1.50 6.72 5.75 
4.5 41.0 0.50 3.33 2.27 
6.2 34.0 0.70 7.56 6.93 
5.6 20.0 0.50 5.07 6.70 
9.0 17.0 0.20 4.39 8.33 
8.4 20.0 0.10 3.74 3.77 
9.5 19.0 0.50 3.72 7.37 
9.0 20.0 0.50 5.97 11.17 
6.2 16.0 0.05 4.23 4.18 
73 20.0 0.50 4.39 3.50 
3.6 15.0 0.70 7.00 4.82 
6.2 34.0 0.07 4.84 2.37 
7.3 22.0 0.00 4.13 2.70 
41 29.0 0.70 5.78 7.76 
5.4 29.0 0.20 4.64 2.65 
5.0 34.0 0.70 4.21 6.50 
6.2 27.0 0.30 3.97 2.97 


11.31. Refer to the data on.salmon in Table 11.2. 

(a) Plot the bivariate data for the two groups of salmon. Are the sizes and orientation of 
the scatters roughly the same? Do bivariate normal distributions with a common co- 
variance matrix appear to be viable population models for the Alaskan and Canadi- 
an salmon? 


(b) Using a linear discriminant function for two normal populations with equal priors 
and equal costs [see (11-19)], construct dot diagrams of the discriminant scores for 
the two groups. Does it appear as if the growth ring diameters separate for the two 
groups reasonably well? Explain. 

(c) Repeat the analysis in Example 11.8 for the male and female salmon separately. Is it 
easier to discriminate Alaskan male salmon from Canadian male salmon than it is to 
discriminate the females in the two groups? Is gender (male or female) likely to be a 
useful discriminatory variable? 


11.32. Data on hemophilia A carriers, similar to those used in Example 11.3, are listed in 
Table 11.8 on page 664. (See [15].) Using these data, 


(a) Investigate the assumption of bivariate normality for the two groups. 


664 Chapter 11 Discrimination and Classification 


= = 
Table 11.8 Hemophilia Data 


Noncarriers (77) Obligatory carriers (7) 
logio logio logio logio 
Group (AHF activity) (AHF antigen)| Group (AHF activity) (AHF antigen) 
—— $$$ 
1 ~.0056 —.1657 2 
1 —.1698 - =,1585 2 
1 —.3469 —.1879 2 
1 ~.0894 .0064 2 
1 —.1679 0713 - 2 
1 — 0836 .0106 2 
1 -.1979 —.0005 2 
1 —.0762 0392 2 
1 —.1913 —.2123 2 
1 —.1092 —.1190 2 
1 —.5268 — 4773 2 
1 —.0842 0248 2 
1 — 0225 —.0580 2 
1 0084 0782 2 
1 —.1827 —.1138 2 
1 1237 .2140 2 
1 —.4702 —.3099 2 
1 ~.1519 ~.0686 2 
1 .0006 —.1153 2 
1 —.2015 —.0498 2 
1 —.1932 —.2293 2 
1 .1507 0933 2 
1 —.1259 —.0669 2 
1 —.1551 —.1232 2 
1 —.1952 —.1007 2 
1 0291 0442 2 
1 —.2228 —.1710 2 
1 —.0997 —,0733 2 
1 —.1972 —.0607 2 
1 —.0867 —.0560 2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 


| 


Source: See [15]. 


11.33. 


11,34, 


11.35. 


Exercises 665 


(b) Obtain the sample linear discriminant function, assuming equal prior probabilities, 
and estimate the error rate using the holdout procedure. ~ 


(c) Classify the following 10 new cases using the discriminant function in Part b. 


(d) Repeat Parts a-c, assuming that the prior probability of obligatory carriers (group 2) 
is f and that of noncarriers (group 1) is j. 


New Cases Requiring Classification 


Case logio(AHF activity) log;9(AHF antigen) 
1 ~.112 —.279 
2 —.059 —.068 
3 064 012 
4 — .043 ~.052 
5 —.050 —.098 
6 ~.094 -.113 
7 ~.123 —.143 
8 ~.011 —.037 
9 —.210 —.090 

10 —.126 ~019 


Consider the data on bulls in Table 1.10. 


(a) Using the variables YrHgt, FtFrBody, PrctFFB, Frame, BkFat, SaleHt, and SaleWt, 
calculate Fisher’s linear discriminants, and classify the bulls as Angus, Hereford, 
or Simental. Calculate an estimate of E(AER) using the holdout procedure. 
Classify a bull with characteristics YrHgt = 50, FtFrBody = 1000, PrctFFB = 73, 
Frame = 7, BkFat = .17, SaleHt = 54, and SaleWt = 1525 as one of the three 
breeds. Plot the discriminant scores for the bulls in the two-dimensional discriminant 
space using different plotting symbols to identify the three groups. 

(b) Is there.a subset of the original seven variables that is almost as good for discrimi- 
Nating among the three breeds? Explore this possibility by computing the estimated 
E(AER) for various subsets. 


Table 11.9 on pages 666-667 contains data on breakfast cereals produced by three 
different American manufacturers: General Mills (G), Kellogg (K), and Quaker (Q). 
Assuming multivariate norma) data with a common covariance matrix, equal costs, and 
equal priors, classify the cereal brands according to manufacturer. Compute the estimat- 
ed E(AER) using the holdout procedure. Interpret the coefficients of the discriminant 
functions. Does it appear as if some manufacturers are associated with more “nutritional” 
cereals (high protein, low fat, high fiber, low sugar, and so forth) than others? Plot the 
cereals in the two-dimensional discriminant space, using different plotting symbols to 
identify the three manufacturers. 


Table 11.10 on page 668 contains measurements on the gender, age, tail length (mm), and 
snout to vent length (mm) for Concho Water Snakes. 


Define the variables 
X, = Gender 
xX, = Age 
X3; = TailLength 
X, = SntoVnLength 


panuyuod 


Zz 02 ral O'el OT 06 0 I Ol x sdog ui) 17 
Zz SE z O'1z OT 062 0 z O01 w SaxeT U0 0Z 
Zz O€ bl OIL OT czI 0 rd OIL Dt syoer addy 61 
Zz OzE i OL 06° 092 T v OZ x uelg TV 81 
if 09 8 Oye) OT 002 I rd OIL 3) PON AsuopsanesymM LT 
I OIL € OLI of 002 I € O0T D saneoyM 9] 
if Sz ral OT oO ~— «OFT I L OLT 3) XI, ST 
I OIL € oor O€ 002 if € O01 3) UIeIL) BJOUM [BIOL PT 
I O€@ bl O'ST Or 061 I € Or >) uvig” UIsTey JeIOL, CT 
I Se € O12 00 §©«©00z if z OIL D sayy WIOD [eo], 71 
I Ort 8 Sol SZ OPT z € OOT D ueig InN UIstley TT 
if al OT Sel St OLT z € O€T D dsp Uistey [eouneg OT 
I 06 9 OST 07 022 I z 00T D SOLIDS) UlRIO-RINW 6 
T gs ZI Oz 00 ~—O8T I z orl D suey Ayon’y g 
if oP € OT? 00 09 I z a D XIN L 
if 06 OL CIr ST 0S2 I € OIL D souaay INN ABUL 9 
I Sp 6 OST 00 ~—« 087 I T OIl D sueYyeIQy UapfoH ¢ 
if $9 EL ovr 00 ~—O8T I 1 al D emooyg UND 
if 6s ET OV 00 ~—O8T I I OIL D synd 8000 ¢ 
I Sot I OLT 07 062 z 9 OIT D SOLa9YD Z 
if OL Or cor Sl O8T z z OIT D soliaeyQ uoueuur) addy 


dnoip umissejog re8ng sayerpAyoqred Ioqly wMIPOS 3 UleI01g salsO[eD| romMjoRNUL| 


Jea1aD jo spueig Uo vIeG 6") ) 219eL 


666 


NANNAANANNANAANANAAINRIAAMNAMNMMN MN MN 


a 
os 
SI 
66 
Sb 
Se 
gs 
Ob 
SE 
Orc 
Sb 
06 
OeT 
Ob 
091 
09 
06T 
oot 
Se 
0€ 
0€ 
oot 


OT 

xe 
O'eT 
0'?T 
0'eT 
ra 
oor 
06 

0% 
OPT 
0°07 
08ST 
OT? 
OST 
OLT 
Ot 
Orr 
O'vT 
OvT 
OTT 
OT? 
oor 


Le 
OT 
00 
07 
oT 
00 
OT 
OT 
00 
os 
OT 
oe 
OE 
00 
oe 
OT 
os 
oe 
oT 
OT 
oT 
OP 


MOoOnrTDOTONMNrnNooFONFONNnNNoOooONn 


MAANAMMAMNNAIMNMNMOMNMNANAN OMA TAIAYH 


MMMM MMMM eM Mr rH OOOO OO 


‘sndeq peyD jo Asownos eyeq :99IN0S 


JeotuneQ ayxenH cp 
wayM PIN 7p 
sony paynd Tp 
aF'T OF 

syO weyeiy Aauozy 6¢ 
younig,u,des g¢ 

W yewadg ye 

SypeUS 9€ 

sardsinyaory Se 

ueig UISIEY pe 

6. WNpolg CE 

yoy UleIS-INN Ze 
ulsley-Ppuowpy UIeIs-MINN TE 
younsy AouopH INN O€ 
pusg Adsuyxijson 62 
sjadanny Aqouniy sty isne gz 
uelg inpmiy £7 
soy TUAL Paisoly 97 
SOxe[Y Powsory CZ 

sdoo’y 10014 7 

xIdsti) €7 

uelg eo UyyoRID 7 


667 


668 Chapter 11 Discrimination and Classification 


OMNNMNEWNeE 


Table 11.10 Concho Water Snake Data 


Gender Age TailLength Snto 


Gender Age TailLength Snto 
VnLength 


VnLength 


Female 2 127 441 1 Male 2 126 457 
Female 2 171 455 2 Male 2 128 466 
Female 2 171 462 3 Male 2 151 466 
Female 2 164 446 4 Male 2 115 361 
Female 2 165 463 5 Male 2 138 473 
Female 2 127 393 6 Male 2 145 A477 
Female 2 162 451 7 Male 3 145 507 
Female 2 8 Male 3 145 493 
Female 2 9 Male 3 158 558 
Female 2 Male 3 152 495 
Female 2 Male 3 159 521 
Female 3 Male 3 138 487 
Female 3 Male 3 166 565 
Female 3 Male 3 168 585 
Female 3 Male 3 160 550 
Female 3 Male 4 181 652 
Female 3 Male 4 185 587 
Female 3 Male 4 172 606 
Female 3 Male 4 180 591 
Female 3 Male 4 205 683 
Female 3 Male 4 175 625 
Female 3 Male 4 182 612 
Female 3 Male 4 185 618 
Female 3 Male 4 181 613 
Female 3 Male 4 167 600 
Female 3 Male 4 167 602 
Female 3 Male 4 160 596 
Female 3 Male 4 165 611 
Female 4 Male 4 173 603 
Female 4 

Female 4 

Female 4 

Female 4 

Female 4 

Female 4 

Female 4 

Female 4 


Source: Data courtesy of Raymond J. Carroll. 


(a) Plot the data as a scatter plot with tail length (x3) as the horizontal axis and snout to 
vent length (x4) as the vertical axis. Use different plotting symbols for female‘and 
male snakes, and different symbols for different ages. Does it appear as if tail length 
and snout to vent length might usefully discriminate the genders of snakes? The dif- 
ferent ages of snakes? 

(b) Assuming multivariate normal! data with a common covariance matrix, equal priors, 
and equal costs, classify the Concho Water Snakes according to gender. Compute the 
estimated E(AER) using the holdout procedure. 


11.36. 


References 669 


(c) Repeat part (b) using age as the groups rather than gender. 


(d) Repeat part (b) using only snout to vent length to classify the snakes according to 
age. Compare the results with those in part (c). Can effective classification be 
achieved with only a single variable in this case? Explain. 


Refer to Example 11.17. Using logistic regression, refit the salmon data in Table 11.2 
with only the covariates freshwater growth and marine growth. Check for the signifi- 
cance of the model and the significance of each individual covariate. Set a = .05. Use 
the fitted function to classify each of the observations in Table 11.2 as Alaskan salmon or 
Canadian salmon using rule (11-77). Compute the apparent error rate, APER, and com- 
pare this error rate with the error rate from the linear classification function discussed in 
Example 11.8. 


References 


14. 


15. 


. Anderson, E. “The Irises of the Gaspé Peninsula.” Bulletin of the American Iris Society, 


59 (1939), 2-S. 


. Anderson, T. W. Av Introduction to Multivariate Statistical Analysis (3rd ed.). New York: 


John Wiley, 2003. 


. Bartlett, M.S. “An Inverse Matrix Adjustment Arising in Discriminant Analysis.” Annals 


of Mathematical Statistics, 22 (1951), 107-111. 


. Bouma, B. N., et al. “Evaluation of the Detection Rate of Hemophilia Carriers.” 


Statistical Methods for Clinical Decision Making,77, no. 2 (1975), 339-350. 


. Breiman, L., J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. 


Belmont, CA: Wadsworth, Inc., 1984. 


. Darroch, J. N., and J. E. Mosimann. “Canonical and Principal Components of Shape.” 


Biometrika, 72, no. 1 (1985), 241-252. 


. Efron, B. “The Efficiency of Logistic Regression Compared to Normal Discriminant 


Analysis.” Journal of the American Statistical Association, 81 (1975), 321-327. 


. Eisenbeis, R. A. “Pitfalls in the Application of Discriminant Analysis in Business, 


Finance and Economics.” Journal of Finance, 32, no. 3 (1977), 875-900. 


. Fisher, R. A. “The Use of Multiple Measurements in Taxonomic Problems.” Annals of 


Eugenics, 7 (1936), 179-188. 


. Fisher, R.A.“The Statistical Utilization of Multiple Measurements.” Annals of Eugenics, 


8 (1938), 376-386. 


. Ganesalingam, S. “Classification and Mixture Approaches to Clustering via Maximum 


Likelihood.” Applied Statistics, 38, no. 3 (1989), 455-466. 


. Geisser, S. “Discrimination, Allocatory and Separatory, Linear Aspects.” In Classificatio- 


n and Clustering, edited by J. Van Ryzin, pp. 301-330. New York: Academic Press, 1977. 


. Gerrild, P. M., and R. J. Lantz. “Chemical Analysis of 75 Crude Oil Samples from 


Pliocene Sand Units, Elk Hills Oil Field, California.” U.S. Geological Survey Open-File 
Report, 1969. 

Gnanadesikan, R. Methods for Statistical Data Analysis of Multivariate Observations 
(2nd ed.). New York: Wiley-Interscience, 1997. 

Habbema, J. D. F., J. Hermans, and K. Van Den Broek. “A Stepwise Discriminant 
Analysis Program Using Density Estimation.” In Compstat 1974, Proc. Computational 
Statistics, pp. 101-110. Vienna: Physica, 1974. 


670 Chapter 11 Discrimination and Classification 


16. 


17. 


18. 


19. 


20. 


21. 


22, 


23. 
24. 


25. 


26. 


27. 


28. 


29, 


30. 
31. 


32. 


Hills, M. “Allocation Rules and Their Error Rates.” Journal of the Royal Statistica{ 
Society (B), 28 (1966), 1-31. 

Hosmer, D. W. and S. Lemeshow. Applied Logistic Regression (2nd ed.). New York: 
Wiley-Interscience, 2000. 


Hudlet, R., and R. A. Johnson. “Linear Discrimination and Some Further Results on 
Best Lower Dimensional Representations.” In Classification and Clustering, edited by 
J. Van Ryzin, pp.371-394. New York: Academic Press, 1977. 


Johnson, W. “The Detection of Influential Observations for Allocation, Separation, and 
the Determination of Probabilities in a Bayesian Framework.” Journal of Business and 
Economic Statistics, 5,no.3 (1987), 369-381. 


Kendall, M. G. Multivariate Analysis. New York: Hafner ae 1975. 


Kim,H. and Loh, W. Y., “Classification Trees with Unbiased Multiway Splits,” Journal of. 
the American Statistical Association, 96, (2001), 589-604. 


Krzanowski, W. J. “The Performance of Fisher’s Linear Discriminant Function under 
Non-Optimal Conditions.” Technometrics, 19, no. 2 (1977), 191-200. 


Lachenbruch, P. A. Discriminant Analysis. New York: Hafner Press, 1975. 


Lachenbruch, P. A., and M. R. Mickey. “Estimation of Error Rates in Discriminant 
Analysis.” Technometrics, 10, no. 1 (1968), 1-11. 

Loh, W. Y. and Shih, Y. S., “Split Selection Methods for Classification Trees,” Statistica 
Sinica, 7, (1997), 815-840. 


McCullagh, P, and J. A. Nelder. Generalized Linear Models (2nd ed.). London: Chapman 
and Hall, 1989. 


Mucciardi, A. N., and E. E. Gose. “A Comparison of Seven Techniques for Choosing Sub- 
sets of Pattern Recognition Properties.” JEEE Trans. Computers, C20 (1971), 1023~1031. 


Murray, G. D. “A Cautionary Note on Selection of Variables in Discriminant Analysis.” 
Applied Statistics, 26, no. 3 (1977), 246-250. 


Rencher, A. C. “Interpretation of Canonical Discriminant Functions, Canonical Variates 
and Principal Components.” The American Statistician, 46 (1992),217-225. 


Stern, H. S. “Neural Networks in Applied Statistics.” Technometrics, 38, (1996), 205-214. 


Wald, A. “On a Statistical Problem Arising in the Classification of an Individual into 
One of Two Groups.” Annals of Mathematical Statistics, 15 (1944), 145-162. 


Welch, B. L. “Note on Discriminant Functions.” Biometrika, 31 (1939), 218-220. 


Chapter 


CLUSTERING, DISTANCE METHODS, 
AND ORDINATION 


12.1 Introduction 


Rudimentary, exploratory procedures are often quite helpful in understanding 
the complex nature of multivariate relationships. For example, throughout 
this book, we have emphasized the value of data plots. In this chapter, we shall dis- 
cuss some additional displays based on certain measures of distance and suggested 
step-by-step rules (algorithms) for grouping objects (variables or items). Searching 
the data for a structure of “natural” groupings is an important exploratory 
technique. Groupings can provide an informal means for assessing dimensionality, 
identifying outliers, and suggesting interesting hypotheses concerning relationships. 

Grouping, or clustering, is distinct from the classification methods discussed in 
the previous chapter. Classification pertains to a known number of groups, and the 
operational objective is to assign new observations to one of these groups. Cluster 
analysis is a more primitive technique in that no assumptions are made concerning 
the number of groups or the group structure. Grouping is done on the basis of simi- 
larities or distances (dissimilarities). The inputs required are similarity measures or 
data from which similarities can be computed. 

To illustrate the nature of the difficulty in defining a natural grouping, consider 
sorting the 16 face cards in an ordinary deck of playing cards into clusters of similar 
objects. Some groupings are illustrated in Figure 12.1. It is immediately clear that 
meaningful partitions depend on the definition of similar. 

In most practical applications of cluster analysis, the investigator knows enough 
about the problem to distinguish “good” groupings from “bad” groupings. Why not 
enumerate all possible groupings and select the “best” ones for further study? 


671 


672 Chapter 12 Clustering, Distance Methods, and Ordination 


| |e : 
Q Q 


A 
K 
Q 
J 


(a) Individual cards (b) Individual suits 


$a24¢97 | ¢$4¢°9 


- O RK > 


(c) Black and red suits (d) Major and minor Suits (bridge) 


¢$4¢°9” 
Ae 
ae 
a 
pS a aes 


(e) Hearts plus queen of spades (f) Like face cards 
and other suits (hearts) 


Figure 12.1 Grouping face cards. 


For the playing-card example, there is one way to form a single group of 
16 face cards, there are 32,767 ways to partition the face cards into two groups (of 
varying sizes), there are 7,141,686 ways to sort the face cards into three groups 
(of varying sizes), and so on.’ Obviously, time constraints make it impossible to 
determine the best groupings of similar objects from a list of all possible struc- 
tures, Even fast computers are easily overwhelmed by the typically large number 
of cases, so one must settle for algorithms that search for good, but not necessarily 
the best, groupings. 

To summarize, the basic objective in cluster analysis is to discover natural 
groupings of the items (or variables). In turn, we must first develop a quantitative 
scale on which to measure the association (similarity) between objects. Section 12.2 
is devoted to a discussion of similarity measures. After that section, we describe a 
few of the more common algorithms for sorting objects into groups. 


'The number of ways of sorting n objects into k nonempty groups is a Stirling number of the second 
k 
kind given by (1/k!) } (—1)*7 € Yr. (See [1].) Adding these numbers for k = 1,2,...,” groups, we 
J=0 


obtain the total number of possible ways to sort m objects into groups. 


Similarity Measures 673 


Even without the precise notion of a natural grouping, we are often able to 
group objects in two- or three-dimensional plots by eye. Stars and Chernoff faces, 
discussed in Section 1.4, have been used for this purpose. (See Examples 1.11 and 
1.12.) Additional procedures for depicting high-dimensional observations in two di- 
mensions such that similar objects are, in some sense, close to one another are con- 
sidered in Sections 12.5-12.7. 


12.2 Similarity Measures 


Most efforts to produce a rather simple group structure from a complex data set re- 
quire a measure of “closeness,” or “similarity.” There is often a great deal of subjec- 
tivity involved in the choice of a similarity measure. Important considerations 
include the nature of the variables (discrete, continuous, binary), scales of measure- 
ment (nominal, ordinal, interval, ratio), and subject matter knowledge. 

When items (units or cases) are clustered, proximity is usually indicated by 
some sort of distance. By contrast, variables are usually grouped on the basis of 
correlation coefficients or like measures of association. 


Distances and Similarity Coefficients for Pairs of Items 


We discussed the notion of distance in Chapter 1, Section 1.5. Recall that the 
Euclidean (straight-line) distance between two p-dimensional observations (items) 
x’ = [x1,%2,-..,x,J and y’ = [yy, ,..-, Yp] is, from (1-12), 


a(xsy) = Vx = yi)? + (x2 — ye H+ (ap — Ye) 
= Vix- ye - y) (12-1) 
The statistical distance between the same two observations is of the form [see (1-23)] 
d(x,y) = V(x — y)'A(x — y) (12-2) 


Ordinarily, A = S"!, where S contains the sample variances and covariances. 
However, without prior knowledge of the distinct groups, these sample quantities 
cannot be computed. For this reason, Euclidean distance is often preferred for 
clustering. 

Another distance measure is the Minkowski metric 


P 1/m 
d(x,y) = BS |x; - nt | (12-3) 
=i 


For m = 1, d(x,y) measures the “city-block” distance between two points in p 
dimensions. For m = 2, d(x, y) becomes the Euclidean distance. In general, varying 
m changes the weight given to larger and smaller differences. 


674 Chapter 12 Clustering, Distance Methods, and Ordination 


Two additional popular measures of “distance” or dissimilarity are given by the 
Canberra metric and the Czekanowski coefficient. Both of these measures are 
defined for nonnegative variables only. We have 


214 yl 


Canberra metric: d(x,y) = 12- 
(9) = 2G Fy) (12-4) 
p 
2 > min(x;, y) 
Czekanowski coefficient: d(x,y) = 1 — aL (12-5) 


> (% + yi) 


Whenever possible, it is advisable to use “true” distances—that is, distances satisfy- 
ing the distance properties of (1-25)—for clustering objects. On the other hand, 
most clustering algorithms will accept subjectively assigned distance numbers that 
may not satisfy, for example, the triangle inequality. 

When items cannot be represented by meaningful p-dimensional measure- 
ments, pairs of items are often compared on the basis of the presence or absence of 
certain characteristics, Similar items have more characteristics in common than do 
dissimilar items. The presence or absence of a characteristic can be described 
mathematically by introducing a binary variable, which assumes the value 1 if the 
characteristic is present and the value 0 if the characteristic is absent. For p = 5 
binary variables, for instance, the “scores” for two itemsi and k might be arranged as 
follows: 


Variables 


—_ 


Item i 1 0 0 1 
Item k 1 1 0 1 0 


In this case, there are two 1-1 matches, one 0-0 match, and two mismatches. 
Let x,; be the score (1 or 0) of the jth binary variable on the ith item and x, ; be the 
score (again, 1 or 0) of the jth variable on the kth item, j = 1,2,..., p. Consequently, 


O if x; = x; =1 or x; =x,;=0 
apa ‘2 tj kj ij kj 
(xij — Xe) ° ae ek (12-6) 


P 

and the squared Euclidean distance, By (45; — Xt yi provides a count of the number 
j=l 

of mismatches. A large distance corresponds to many mismatches—that is, dissimi- 

lar items. From the preceding display, the square of the distance between items i and 


k would be 


5 
SD (xij — 24)? = 1 ~ 1)? + (0-1)? + (0- oP + - 1% + 1 - OF 
j=1 


=2 


Similarity Measures 675 


Although a distance based on (12-6) might be used to measure similarity, it suf- 
fers from weighting the 1-1 and 0-0 matches equally. In some cases, a 1-1 match is a 
stronger indication of similarity than a 0-O match. For instance, in grouping people, 
the evidence that two persons both read ancient Greek is stronger evidence of simi- 
larity than the absence of this ability. Thus, it might be reasonable to discount the 
Q-0 matches or even disregard them completely. To allow for differential treatment 
of the 1-1 matches and the 0-0 matches, several schemes for defining similarity co- 
efficients have been suggested. 

To introduce these schemes, let us arrange the frequencies of matches and mis- 
matches for items i and k in the form of a contingency table: 


Totals 


a+b 
c+d 


Item i } (12-7) 


0 


Totals atc bt+d p=a+b+ct+d 


In this table, a represents the frequency of 1-1 matches, b is the frequency of 1-0 
matches, and so forth. Given the foregoing five pairs of binary outcomes, a = 2 and 
b=c=d=1. 

Table 12.1 lists common similarity coefficients defined in terms of the frequen- 
cies in (12-7). A short rationale follows each definition. 


Table 12.1 Similarity Coefficients for Clustering Items* 


Coefficient Rationale 


at+d 
P 


1. 


Equal weights for 1-1 matches and 0-0 matches. 


2(a +d) 
“2(a+d)+bt+e 
3 at+d 
“a+d+2(b+c) 


Double weight for 1-1 matches and 0-0 matches. 


Double weight for unmatched pairs. 


a : 
4. s No 0-0 matches in numerator. 
a : ; 
_— No 0-0 matches in numerator or denominator. 
at+b+c (The 0-0 matches are treated as irrelevant.) 
2a : 
6. ————_———_ No 0-0 matches in numerator or denominator. 
da+bt+e Double weight for 1-1 matches. 
a ; : 
7. —————_- No 0-0 matches in numerator or denominator. 
a+2(b +c) Double weight for unmatched pairs. 
a : : ; 
8. Ratio of matches to mismatches with 0-0 matches 
b+e excluded. 


*(p binary variables; see (12-7).] 


676 Chapter 12 Clustering, Distance Methods, and Ordination 


Coefficients 1, 2, and 3 in the table are monotonically related. Suppose 
coefficient 1 is calculated for two contingency tables, Table I and Table II. Then 
if (a; + d))/p = (an + dy)/p, we also have 2(a, + d,)/[2(a, + dj) + b} + J 
= 2(ay + dy)/(2(ay + dy) + Sy + cy], and coefficient 3 will be at least as large 
for Table J as it is for Table IT. (See Exercise 12.4.) Coefficients 5, 6, and 7 also re- 
tain their relative orders. 

Monotonicity is important, because some clustering procedures are not affected 
if the definition of similarity is changed in a manner that leaves the relative orderings 
of similarities unchanged. The single linkage and complete linkage hierarchical 
procedures discussed in Section 12.3 are not affected. Far these methods, any choice 
of the coefficients 1,2, and 3 in Table 12.1 will produce the same groupings. Similarly, 
any choice of the coefficients 5, 6, and 7 will yield identical groupings. 


Example 12.1 (Calculating the values of a similarity coefficient) Suppose five indi- 
viduals possess the following characteristics: 


Eye Hair 
Height Weight color color Handedness Gender 
Individuall 68 in 140 Ib green blond right female 
Individual2 73 in 185 lb brown brown right male 
Individual3 67 in 165 Ib blue blond right male 
Individual4 64 in 120 Ib brown brown right female 
Individual5 76in 210 Ib brown brown left male 
Define six binary variables X;, .X2, X3, X4, Xs, X¢ as 
x, = 1 height = 72 in. x= 1 blond hair 
1 (0 height < 72in. * ~ |0. not blond hair 
1 weight = 1501b 1 right handed 
X> = 7 Xs = 
O weight < 1501b 0 lefthanded 
_ J1_ brown eyes _ }1 female 
ame . otherwise AB i male 
The scores for individuals 1 and 2 on the p =: 6 binary variables are 
x x2 X3 Xs X5 X6 
Individual 1 0 0 0 1 1 1 
2 1 1 1 0 1 0 


and the number of matches and mismatches are indicated in the two-way array 
Individual 2 


Individual 1 


Similarity Measures 677 


Employing similarity coefficient 1, which gives equal weight to matches, we 
compute 


at+td_1+0_ 1 
“Pp 6 6 
Continuing with similarity coefficient 1, we calculate the remaining similarity 
numbers for pairs of individuals. These are displayed in the 5 X 5 symmetric 
matrix 


Individual 
12 3 4 § 
1 1 
faite Ae ee 
Individual ie 33 
3 r3 é 1 
4 3 2 
2 eo co ae, 
2 2 
slo@? 21 


Based on the magnitudes of the similarity coefficient, we should conclude that 
individuals 2 and 5 are most similar and individuals 1 and 5 are Jeast similar. Other 
pairs fall between these extremes. If we were to divide the individuals into two rela- 
tively homogeneous subgroups on the basis of the similarity numbers, we might 
form the subgroups (1 3 4) and (2 5). 

Note that X; = 0 implies an absence of brown eyes, so that two people, one 
with blue eyes and one with green eyes, will yield a 0-0 match. Consequently, it may 
be inappropriate to use similarity coefficient 1,2, or 3 because these coefficients give 
the same weights to 1~1 and 0-0 matches. = 


We have described the construction of distances and similarities. It is always 
possible to construct similarities from distances. For example, we might set 
a1 
1+ dix 
where 0 < 5;, =< 1 is the similarity between items i and k and d,, is the corre- 
sponding distance. 

However, distances that must satisfy (1-25) cannot always be constructed from 
similarities. As Gower [11,12] has shown, this can be done only if the matrix of sim- 
ilarities is nonnegative definite. With the nonnegative definite condition, and with 
the maximum similarity scaled so that 5;; = 1, 


dix = V2(1 — Six) (12-9) 


has the properties of a distance. 


(12-8) 


Sik = 


Similarities and Association Measures for Pairs of Variables 


Thus far, we have discussed similarity measures for items In some applications, it is 
the variables, rather than the items, that must be grouped. Similarity measures for 
variables often take the form of sample correlation coefficients. Moreover, in some 
clustering applications, negative correlations are replaced by their absolute values. 


678 Chapter 12 Clustering, Distance Methods, and Ordination 


When the variables are binary, the data can again be arranged in the form of a 
contingency table. This time, however, the variables, rather than the items, delineate 
the categories. For each pair of variables, there are n items categorized in the table. 
With the usual 0 and 1 coding, the table becomes as follows: 


Variable k 


Totals 


a+b 
ct+d 


Variable i (12-10) 


n=atbict+d 


For instance, variable i equals 1 and variable k equals 0 for b of the n items. 
The usual product moment correlation formula applied to the binary variables 
in the contingency table of (12-10) gives (see Exercise 12.3) 


ad ‘A 
r= [(a + b)(c + d)(a + c)(b + d)]'? (12-11) 


This number can be taken as a measure of the similarity between the two variables. 
The correlation coefficient in (12-11) is related to the chi-square statistic 
(r? =_x7/n) for testing the independence of two categorical variables. For n fixed, a 
large similarity (or correlation) is consistent with the presence of dependence. 
Given the table in (12-10), measures of association (or similarity) exactly analo- 
gous to the ones listed in Table 12.1 can be developed. The only change required is 
the substitution of n (the number of items) for p (the number of variables). 


Concluding Comments on Similarity 


To summarize this section, we note that there are many ways to measure the simi- 
larity between pairs of objects. It appears that most practitioners use distances [see 
(12-1) through (12-5)] or the coefficients in Table 12.1 to cluster items and correla- 
tions to cluster variables. However, at times, inputs to clustering algorithms may be 
simple frequencies. 


Example 12.2 (Measuring the similarities of 11 languages) The meanings of words 
change with the course of history. However, the meaning of the numbers 1, 2, 3,... 
represents one conspicuous exception. Thus, a first comparison of languages might 
be based on the numerals alone. Table 12.2 gives the first 10 numbers in English, 
Polish, Hungarian, and eight other modern European languages. (Only languages 
that use the Roman alphabet are considered, and accent marks, cedillas, diereses, 
etc.,are omitted.) A cursory examination of the spelling of the numerals in the table 
suggests that the first five languages (English, Norwegian, Danish, Dutch, and Ger- 
man) are very much alike. French, Spanish, and Italian are in even closer agreement. 
Hungarian and Finnish seem to stand by themselves, and Polish has some of the 
characteristics of the languages in each of the larger subgroups. 


uauatIA ¥ 
uesyapyd 
uesyopyey 
UBUIAS}IOS 
isnny 

ISTIA 

eljou 
WOH 
(sxey 

Isy 

(td) 


ysraur.y 


—- SS _ 


mM selseizp = balp Zolp 
ouaTIy = DaIMaIzp 3A0u aaonu 
ojoAu wWalso 0}}0 oyoo 
eq wapels a}}0S a}als 
yeu 2Sazs Tas Stas 

jo seid = onbur ours 
AZau Axajzo_ onjyenb =o oyend 
qworeg AzXy any soz} 
00H BAD anp sop 
A8a uopel oun oun 


(H) (d) 0) (dg) 


ueliesunpy ysyog uelyey, ystueds 


xp 


jnou 
amy 


das. 


XS 
buna 
axenb 
s1ox} 
xnop 
un 


(44) 
qouely 


uyoz 
unou 
yore 
uaqgals 
sqoos 
juny 
JOA 
{orp 
1OMZ 
sure 


(D) 


ueusag 


uay y n Wo} 
uadou Tu Tu ouru 
yor 330 ae 1Y 810 
UdAdzZ Ads nfs uaaas 
SoZ syas S¥as XIS 
yfta way way OAly 
JOIA ay any Moy 
op ey ay 9914} 
20M} 0} 0} On} 
uso us ua auo 
(nq) (ea) (N) (a) 
uonq =ystueq ueigamIoN ysipsuq 


Sosengue’y [] Ul sesouNN ZZi 3qey 


679 


680 Chapter 12 Clustering, Distance Methods, and Ordination 


Table 12.3 Concordant First Letters for Numbers in 11 Languages 
E N Da Du G Fr Sp I P H Fi 
E 10 
N 8 10 
Da 8 9 10 
Du 3 5 4 10 
G 4 6 5 5 10 
Fr 4 4 4 1 3 10 
Sp 4 4 5 1 3 8 10 . 
I 4 4 5 1 3 9 9 10 
P 3 3 4 0 2 5 7 6 10 
H 1 2 2 2 1 0 0 0 90 10 
Fi 1 1 1 1 1 1 1 1 1 2 10 


The words for | in French, Spanish, and Italian all begin with u. For illustrative 
purposes, we might compare languages by looking at the first letters of the numbers. 
We call the words for the same number in two different languages concordant if they 
have the same first letter and discordant if they do not. From Table 12.2, the table of 
concordances (frequencies of matching first initials) for the numbers 1-10 is given in 
Table 12.3: We see that English and Norwegian have the same first letter for 8 of the 
10 word pairs. The remaining frequencies were calculated in the same manner. 

The results in Table 12.3 confirm our initial visual impression of Table 12.2. That 
is, English, Norwegian, Danish, Dutch, and German seem to form a group. French, 
Spanish, Italian, and Polish might be grouped together, whereas Hungarian and 
Finnish appear to stand alone. = 


In our examples so far, we have used our visual impression of similarity or dis- 
tance measures to form groups. We now discuss less subjective schemes for creating 
clusters. 


12.3 Hierarchical Clustering Methods 


We can rarely examine all grouping possibilities, even with the largest and fastest 
computers. Because of this problem, a wide variety of clustering algorithms have 
emerged that find “reasonable” clusters without having to look at all configurations. 

Hierarchical clustering techniques proceed by either a series of successive 
mergers or a series of successive divisions. Agglomerative hierarchical methods start 
with the individual objects. Thus, there are initially as many clusters as objects. The 
most similar objects are first grouped, and these initial groups are merged according 
to their similarities. Eventually, as the similarity decreases, all subgroups are fused 
into a single cluster. 

Divisive hierarchical methods work in the opposite direction. An initial single 
group of objects is divided into two subgroups such that the objects in one subgroup 
are “far from” the objects in the other. These subgroups are then further divided 
into dissimilar subgroups; the process continues until there are as many subgroups 
as objects—that is, until each object forms a group. 


Hierarchical Clustering Methods 681 


The results of both agglomerative and divisive methods may be displayed in the 
form of a two-dimensional diagram known as a dendrogram. As we shall see, the 
dendrogram illustrates the mergers or divisions that have been made at successive 
levels. 

In this section we shall concentrate on agglomerative hierarchical procedures 
and, in particular, /inkage methods. Excellent elementary discussions of divisive 
hierarchical procedures and other agglomerative techniques are available in [3] 
and [8]. 

Linkage methods are suitable for clustering items, as well as variables. This is 
not true for all hierarchical agglomerative procedures. We shall discuss, in turn, 
single linkage (minimum distance or nearest neighbor), complete linkage (maxi- 
mum distance or farthest neighbor), and average linkage (average distance). The 
merging of clusters under the three linkage criteria is illustrated schematically in 
Figure 12.2. 

From the figure, we see that single linkage results when groups are fused ac- 
cording to the distance between their nearest members. Complete linkage occurs 
when groups are fused according to the distance between their farthest members. 
For average linkage, groups are fused according to the average distance between 
pairs of members in the respective sets. 

The following are the steps in the agglomerative hierarchical clustering algo- 
rithm for grouping N objects (items or variables): 

1. Start with N clusters, each containing a single entity and an N X N symmetric 

matrix of distances (or similarities) D = {d;,}. 

2. Search the distance matrix for the nearest (most similar) pair of clusters. Let the 
distance between “most similar” clusters U and V be dyy. 


Cluster distance 


a N a ~\ 
/\e \ / neo 
\ eee “ 
5 
S27 Soe 
(a) 
ee FO 
fy \ / be \ 
{ j [e4 | dys 
\ / \ Sy 
(b) 
a a 
| 4 ) dy3 + dig + dys + dos + dag + a5 
\ 2 5 6 
~ e XN 7 


(c) 


Figure 12.2 Intercluster distance (dissimilarity) for (a) single linkage, (b) complete 
linkage, and (c) average linkage. 


682 Chapter 12 Clustering, Distance Methods, and Ordination 


3. Merge clusters U/ and V. Label the newly formed cluster (UV). Update the en- 
tries in the distance matrix by (a) deleting the rows and columns corresponding 
to clusters U and V and (b) adding a row and column giving the distances be- 
tween cluster (UV) and the remaining clusters. 

4. Repeat Steps 2 and 3 a total of N — 1 times. (All objects will be in a single 
cluster after the algorithm terminates.) Record the identity of clusters that 
are merged and the levels (distances or similarities) at which the mergers take 
place. (12-12) 


The ideas behind any clustering procedure are probably best conveyed through 
examples, which we shall present after brief discussions of the input and algorithmic 
components of the linkage methods. 


Single Linkage 

The inputs to a single linkage algorithm can be distances or similarities between 
pairs of objects. Groups are formed from the individual entities by merging nearest 
neighbors, where the term nearest neighbor connotes the smallest distance or largest 
similarity. 

Initially, we must find the smallest distance in D = {d;,} and merge the 
cortesponding objects, say, U and V, to get the cluster (UV). For Step 3 of the general 
algorithm of (12-12), the distances between (UV) and any other cluster W are 
computed by 


duvyw = min {dyw,.dyw} (12-13) 


Here the quantities dy y and dy w are the distances between the nearest neighbors 
of clusters U and W and clusters Vand W, respectively. 

The results of single linkage clustering can be graphically displayed in the form 
of a dendrogram, or tree diagram. The branches in the tree represent clusters. The 
branches come together (merge) at nodes whose positions along a distance (or 
similarity) axis indicate the level at which the fusions occur. Dendrograms for some 
specific cases are considered in the following examples. 


Example 12.3 (Clustering using single linkage) To illustrate the single linkage 
algorithm, we consider the hypothetical distances between pairs of five objects as 


follows: 
1-2-3 AS 
had et 
2|9 O 
Dotdul= 3 13 7 6 
4)6 5990 
5 111 10@8 0 


Treating each object as a cluster, we commence clustering by merging the two 
closest items. Since 


min (dix) = ds3 = 2 


Hierarchical Clustering Methods 683 


objects 5 and 3 are merged to form the cluster (35). To implement the next level of 
clustering, we need the distances between the cluster (35) and the remaining objects, 
1,2, and 4. The nearest neighbor distances are 

435)1 = min {d3,, ds} = min {3, 11} = 3 

d(38)2 = min {d32, dsz} = min {7, 10} =7 

di35)4 = min {d34, dsa} = min {9, 8} =8 
Deleting the rows and columns of D corresponding to objects 3 and 5, and adding a 
row and column for the cluster (35), we obtain the new distance matrix 


(35) 12 4 
(35) [ oO 

1 @ 0 

2 7 90 

4 8 650 


The smallest distance between pairs of clusters is now d,3s), = 3, and we merge 
cluster (1) with cluster (35) to get the next cluster, (135). Calculating 


di \35)2 = min {d/35)2, di2} = min {7,9} = 7 
di3sja = Min {d(35)4, 4,4} = min {8, 6} = 6 
we find that the distance matrix for the next level of clustering is 


(135) 2 4 
(135) 0 
2 7 


0 
4 6 © 0 

The minimum nearest neighbor distance between pairs of clusters is dz. = 5,and we 
merge objects 4 and 2 to get the cluster (24). 

At this point we have two distinct clusters, (135) and (24). Their nearest neigh- 
bor distance is 

d(135)(24) = min { di135)2, dc13sy4} = min {7,6} = 6 

The final distance matrix becomes 


(135) (24) 
(135) 0 
(24) © 0 


Consequently, clusters (135) and (24) are merged to form a single cluster of all five 
objects, (12345), when the nearest neighbor distance reaches 6. 

The dendrogram picturing the hierarchical clustering just concluded is shown in 
Figure 12.3. The groupings and the distance levels at which they occur are clearly 
illustrated by the dendrogram. = 


In typical applications of hierarchical clustering, the intermediate results— 
where the objects are sorted into a moderate number of clusters—are of chief 
interest. 


684 Chapter 12 Clustering, Distance Methods, and Ordination 


6 
4 
8 
& 
z 
2 
9 : Figure {2.3 Single linkage 
a _ dendrogram for distances between 
Objects five objects. 


Example 12.4 (Single linkage clustering of {1 languages) Consider the array of con- 
cordances in Table 12.3 representing the closeness between the numbers 1-10 in 11 
languages. To develop a matrix of distances, we subtract the concordances from the 
perfect agreement figure of 10 that each language has with itself. The subsequent 
assignments of distances are 


E N Da Du G Fr Sp I P H Fi 


E [0 
N |2 0 

Da |2 @ 0 

Du {7 5 6 O 

G 16 4 5 5 0 

Fr |6 6 6 9 7 0 

Sp |6 6 5 9 7 2 0 

I |6 6 59 7@@ 0 
PPP og 0-8) oS 8s a 0 

H |9 8 8 8 9 10 10 10 10 0 
F119 9 9 99 9 9 9 9 8 OJ 


We first search for the minimum distance between pairs of languages (clusters). 
The minimum distance, 1, occurs between Danish and Norwegian, Italian and 
French, and Italian and Spanish. Numbering the languages in the order in which 
they appear across the top of the array, we have 


dy =1; dg = 1; anddg = 1 


Since dg = 2, we can merge only clusters 8 and 6 or clusters 8 and 7. We cannot 
merge clusters 6, 7, and 8 at level 1. We choose first to merge 6 and 8, and then to 
update the distance matrix and merge 2 and 3 to obtain the clusters (68) and (23). 
Subsequent computer calculations produce the dendrogram in Figure 12.4. 

From the dendrogram, we see that Norwegian and Danish, and also French and 
Italian, cluster at the minimum distance (maximum similarity) level. When the 
allowable distance is increased, English is added to the Norwegian—Danish group, 


Distance 


Hierarchical Clustering Methods 685 


Figure 12.4 Single linkage 
dendrograms for distances 
Languages between numbers in 11 languages. 


E N Da Fr I Sp P Du G H Fi 


and Spanish merges with the French-Italian group. Notice that Hungarian and 
Finnish are more similar to each other than to the other clusters of languages. How- 
ever, these two clusters (languages) do not merge until the distance between nearest 
neighbors has increased substantially. Finally, all the clusters of languages are 
merged into a single cluster at the largest nearest neighbor distance, 9. = 


Since single linkage joins clusters by the shortest link between them, the tech- 
nique cannot discern poorly separated clusters. [See Figure 12.5(a).] On the other 
hand, single linkage is one of the few clustering methods that can delineate nonel- 
lipsoidal clusters. The tendency of single linkage to pick out long stringlike clusters 
is known as chaining. [See Figure 12.5(b).] Chaining can be misleading if items at 
opposite ends of the chain are, in fact, quite dissimilar. 


Variable 2 Variable 2 


Nonelliptical 


eo fe Elliptical \ configurations 
NX 


ee? 
Log %_° configurations ae rN 
e 0 Ba e a } 
ore otee.e t 
erty / 
38 enn 
°e ee 
‘Variable | ———Variable | 
(a) Single linkage confused by near overlap (b) Chaining effect 


Figure 12.5 Single linkage clusters. 


The clusters formed by the single linkage method will be unchanged by any as- 
signment of distance (similarity) that gives the same relative orderings as the initial 
distances (similarities). In particular, any one of a set of similarity coefficients from 
Table 12.1 that are monotonic to one another will produce the same clustering. 


Complete Linkage 


Complete linkage clustering proceeds in much the same manner as single linkage 
clusterings, with one important exception: At each stage, the distance (similarity) 
between clusters is determined by the distance (similarity) between the two 


686 Chapter 12 Clustering, Distance Methods, and Ordination 


elements, one from each cluster, that are most distant. Thus, complete linkage 
ensures that all items in a cluster are within some maximum distance (or minimum 
similarity) of each other. 

The general agglomerative algorithm again starts by finding the minimum entry 
in D = {d;,} and merging the corresponding objects, such as U and V, to get cluster 
(UV). For Step 3 of the general algorithm in (12-12), the distances between (UV) 
and any other cluster W are computed by 


diyyw = max {dyw, dyw} (12-14) 


Here dyy and dy yw are the distances between the most distant members of clusters 
U and W and clusters V and W, respectively. 


Example 12.5 (Clustering using complete linkage) Let us return to the distance 
matrix introduced in Example 12.3: 


12345 
1 To 
2/9 0 
DAG oe Va 9 
416 590 


5 L11 10@8 0 


At the first stage, objects 3 and 5 are merged, since they are most similar. This gives 
_ the cluster (35). At stage 2, we compute 
dias) = max {dy, dsy} = max {3,11} = 11 
d(3sy2 = max {d32, dsp} = 10 
max {d3,,ds54} = 9 


tl 


d(35)4 


and the modified distance matrix becomes 


(35) 1 2 4 
(35) [| 0 
1 11 0 
3 10 9 0 


4 9 660 


The next merger occurs between the most similar groups, 2 and 4, to give the cluster 
(24). At stage 3, we have 
(24) (38) = Max {4y(35), dayas)} = max {10,9} = 10 
d(24)1 = max {d2), day} =9 
and the distance matrix 
(35) (24) 1 
(35) 0 
(24) 10 0 


1 Li @® 0 


Distance 


Hierarchical Clustering Methods 687 


Distance 
a 


Figure 12.6 Complete linkage 
dendrogram for distances between 
Objects five objects. 


1 2 4 3 


The next merger produces the cluster (124). At the final stage, the groups (35) and 
(124) are merged as the single cluster (12345) at level 


Gaza) (35) = Max {dy(35), di24)(35)} = max {11,10} = 11 
The dendrogram is given in Figure 12.6. ] 


Comparing Figures 12.3 and 12.6, we see that the dendrograms for single link- 
age and complete linkage differ in the allocation of object 1 to previous groups. 


Example 12.6 (Complete linkage clustering of 11 languages) In Example 12.4, we 
presented a distance matrix for numbers in 11 languages. The complete linkage clus- 
tering algorithm applied to this distance matrix produces the dendrogram shown in 
Figure 12.7. : 

Comparing Figures 12.7 and 12.4, we see that both hierarchical methods yield the 
English-Norwegian-Danish and the French—Italian-Spanish language groups. Polish is 
merged with French-Italian-Spanish at an intermediate level. In addition, both meth- 
ods merge Hungarian and Finnish only at the penultimate stage. 

Howeyer, the two methods handle German and Dutch differently. Single link- 
age merges German and Dutch at an intermediate distance, and these two lan- 
guages remain a cluster until the final merger. Complete linkage merges German 


E N Da G Fr I Sp P Do Ht Fi Figure 12.7 Complete linkage 


dendrogram for distances between 
Languages numbers in 11 languages. 


688 Chapter 12 Clustering, Distance Methods, and Ordination 


with the English-Norwegian—Danish group at an intermediate level. Dutch remains 
a cluster by itself until it is merged with the English-Norwegian—~Danish~German 
and French-Italian—-Spanish—Polish groups at a higher distance level. The final com- 
plete linkage merger involves two clusters. The final merger in single linkage in. 
volves three clusters. = 


Example {2.7 (Clustering variables using complete linkage) Data collected on 22 
USS. public utility companies for the year 1975 are listed in Table 12.4. Although it is 
more interesting to group companies, we shall see here how the complete linkage al- 
gorithm can be used to cluster variables. We measure the similarity between pairs of 


Table 12.4 Public Utility Data (1975) 


Variables 


Company 


1. Arizona Public Service 
2. Boston Edison Co. 


ho 
w 


3. Central Louisiana Electric Co. 1.43 15.4 113 53.0 3.4 9212 0. 1.058 
4, Commonwealth Edison Co. - 1.02 11.2 168 56.0 3 6423 34.3 .700 
5. Consolidated Edison Co.(N.Y.) 149 88 192 512 10 3300 156 2.044 
6. Florida Power & Light Co. 1.32 13.5 111 600 -2.2 11127 22.5 1.241 
7. Hawaiian Electric Co. 1.22 12.2 175 67.6 2.2 7642 0. 1.652 
8. Idaho Power Co. 110 92 245 57.0 33 13082 02. 309 
9. Kentucky Utilities Co. 1.34 13.0 168 60.4 7.2 8406 O02. 862 

10. Madison Gas & Electric Co. 112 12.4 197 53.0 2.7 6455 39.2 623 

11. Nevada Power Co. 7 75 173 51.5 65 17441 02. .768 

12. New England Electric Co. 113° 10.9 178 620 37 6154 0. 1.897 

13. Northern States Power Co. 1.15 12.7. 199 53.7 64 7179 50.2 527 

14. Oklahoma Gas & Electric Co. 1.09 12.0 96 49.8 14 9673 0. 

15. Pacific Gas & Electric Co. .96 7.6 164 62.2 -01 6468 

16. Puget Sound Power & LightCo. 1.16 99 252 560 92 15991 

17. San Diego Gas & Electric Co. 76 64 136 61.9 9.0 5714 

18. The Southern Co. 1.05 12.6 150 56.7 2.7 10140 

19. Texas Utilities Co. 1.16 11.7 104 54.0 -2.1 13507 

20. Wisconsin Electric Power Co. 1.20 11.8 148 59.9 3.5 7287 

21. United JNuminating Co. 104 86 204 61.0 3.5 6650 


22. Virginia Electric & Power Co. 107 9.3 174 54.3 5.9 10093 


Key: .X;: Fixed-charge coverage ratio (income/debt). 
X,: Rate of return on capital. 
X3: Cost per KW capacity in place. 
X,: Annual load factor. 
X,: Peak kWh demand growth from 1974 to 1975. 
X¢: Sales (kWh use per year). 
X7: Percent nuclear. 
X3: Total fuel costs (cents per kWh). 
Source: Data courtesy of H. E. Thompson. 


Hierarchical Clustering Methods 689 


Table 12.5 Correlations Between Pairs of Variables (Public Utility Data) 


Xx; Xz X3 X4 Xs X6 AX; Xe 
1.000 
643 1.000 


~.103 -—.348 1.000 

-.082 ~.086 .100 1.000 

—.259 -.260 435 .034 1.000 

-—152 ~.010 .028 —.288 176 ~=1.000 

045 211 115 ~.164 ~-.019 -—.374 1.000 
—.013 -.328 005 486 -.007 -.561 -.185 1.000 


variables by the product-moment correlation coefficient. The correlation matrix is 
given in Table 12.5. 

When the sample correlations are used as similarity measures, variables with 
large negative correlations are regarded as very dissimilar; variables with large pos- 
itive correlations are regarded as very similar. In this case, the “distance” between 
clusters is measured as the smallest similarity between members of the correspond- 
ing clusters. The complete linkage algorithm, applied to the foregoing similarity ma- 
trix, yields the dendrogram in Figure 12.8. 

We see that variables 1 and 2 (fixed-charge coverage ratio and rate of return on 
capital), variables 4 and 8 (annual load factor and total fuel costs), and variables 3 
and 5 (cost per kilowatt capacity in place and peak kilowatthour demand growth) 
cluster at intermediate “similarity” levels. Variables 7 (percent nuclear) and 6 (sales) 
remain by themselves until the final stages. The final merger brings together the 
(12478) group and the (356) group. = 


As in single linkage, a “new” assignment of distances (similarities) that have the 
same relative orderings as the initial distances will not change the configuration of 
the complete linkage clusters. 


=4e= 
-2- 
BS oF 
a 
E 2k 
8 b 
> AL 
F L 
4 P 
B 
BL 
rok 


Figure 12.8 Complete linkage 
dendrogram for similarities among 
Variables eight utility company variables. 


1 2 7 4 8 3 5 


‘690 Chapter 12 Clustering, Distance Methods, and Ordination 


Average Linkage 


Average linkage treats the distance between two clusters as the average distance 
between all pairs of items where one member of a pair belongs to each cluster. 

Again, the input to the average linkage algorithm may be distances or similari- 
ties, and the method can be used to group objects or variables. The average linkage 
algorithm proceeds in the manner of the general algorithm of (12-12). We begin by 
searching the distance matrix D = {d;,} to find the nearest (most similar) objects— 
for example, U and V. These objects are merged to form the cluster (UV). For Step 
3 of the general agglomerative algorithm, the distances between (UV) and the other 
cluster W are determined by 


d = ———_ 12-15 

ee Nuv) Nw ( ) 
where dj, is the distance between object i in the cluster (UV) and object k in the 
cluster W, and Nyy) and My are the number of items in clusters (UV) and W, 
respectively. 


Example 12.8 (Average linkage clustering of I | languages) The average linkage al- 
gorithm was applied to the “distances” between 11 languages given in Example 12.4. 
The resulting dendrogram is displayed in Figure 12.9. 


10 
8 
o 
os 
5 6 
2 
4 
2 
7 bs t i | Bee ie Figure 12.9 Average linkage 
Nise es ENE SP ' dendrogram for distances between 
Languages numbers in 11] languages. 


A comparison of the dendrogram in Figure 12.9 with the corresponding single 
linkage dendrogram (Figure 12.4) and complete linkage dendrogram (Figure 12.7) 
indicates that average linkage yields a configuration very much like the complete 
linkage configuration. However, because distance is defined differently for each 
case, it is not surprising that mergers take place at different levels. m 


Example 12.9 (Average linkage clustering of public utilities) An average linkage 
algorithm applied to the Euclidean distances between 22 public utilities (see 
Table 12.6) produced the dendrogram in Figure 12.10 on page 692. 


691 


Hierarchica] Clustering Methods 


00 


IDOE CC LOE SOC EOE 


00° «ST'h LOY 6I'E OTE 
00 O6€ 767 LEP 

00 = ér’'Z «609 

oO fr 

00 


ore 
LS’ 
wp 
ES 
LOE 
99° 


IS¢ 
C6 
vLe 
88T 
0e'7 
99° 
LVS 


yep 
00° 


vI 


pl? 
vey 
£77 
Tos 
BLE 
T'S 
lrp 
ors 
6eP 
00° 


et 


pre 99% IZE POE OOD CO LLE BS7T ITP 27 TSe% %Z 
88r 9b PLE 86E 897 EDF POE BBE 60S ZC SHE 12 
SeS SOZ S67 60h PSE 167 6Eb ZBI ELE OO'E LTE 02 
cy Clp ib Ib Uh Bot ETS OPE BTE EO It OF 
S6e LOE CYC PTE S67 SBT PE SHOT TL? OCT BET BT 
cLby 8hS 06h Crs 8Sbh OT9 9S 68 9E9 TIE Obb LT 
fre Sb EOF O77 POS HBS 7S Lob OTS PB LOH OT 
bly 9b ITP SRE £67 LO OTH STE OTS OST 697 ST 
web IKE WWE Per lO Lee TBP CTE HLZ ep ITZ pt 
Tes It 99€ bh TOS Sob OL BOC BE bP EVE 9GE ET 
17S POE PLZ 9'r ODOT PLE OVE OSE COb EMT @ZE ZI 
00° «80'S BTS SFE 009 009 SPD 98h 06'S BLP SHE TT 
oo LOE LOE ISb EVE SOW OPT C6E IL7 OTE OT 

00° «~6S'E O87 ELE BY SLE SLZ OBE STE 6 

00 9b T6b OTS 69 66°F GFE PLZ 8 

00° SEE OOP LOE @b SPE OGE ZL 

00° «09 OVE 66% Th INE 9 

00 «6flb Lyy SBE CTb Ss 

00 «=ITry 91% OFZ F 

00 tr B9E € 

00 Ole Z 

00 OT 
it OL 6 8 L 9 Ss v € @ fT ‘ou 
uns] 


SOUNNLY ZZ UdaMjog Soourisiq] 9°Z! ByqeL 


692 Chapter 12 Clustering, Distance Methods, and Ordination 


Distance 
N 


1 1819 14 9 3 6 22 10 13 20 4 7 122115 2 11 16 8 5 17 
Public utility companies 


Figure 12.10 Average linkage dendrogram for distances between 22 public utility 
companies, 


Concentrating on the intermediate clusters, we see that the utility companies 
tend to group according to geographical location. For example, one intermediate 
cluster contains the firms 1 (Arizona Public Service), 18 (The Southern Company— 
primarily Georgia and Alabama), 19 (Texas Utilities Company), and 14 (Oklahoma 
Gas and Electric Company). There are some exceptions The cluster (7, 12, 21, 15, 2) 
contains firms on the eastern seaboard and in the far west. On the other hand, all 
these firms are located near the coasts. Notice that Consolidated Edison Company 
of New York and San Diego Gas and Electric Company stand by themselves until 
the final amalgamation stages. 

It is, perhaps, not surprising that utility firms with similar locations (or types of 
locations) cluster. One would expect regulated firms in the same area to use, basi- 
cally, the same type of fuel(s) for power plants and face common markets. Conse- 
quently, types of generation, costs, growth rates, and so forth should be relatively 
homogeneous among these firms. This is apparently reflected in the hierarchical 
clustering. = 


For average linkage clustering, changes in the assignment of distances (similari- 
ties) can affect the arrangement of the final configuration of clusters, even though 
the changes preserve relative orderings. 


Ward’s Hierarchical Clustering Method 


Ward [32] considered hierarchical clustering procedures based on minimizing the 
‘loss of information’ from joining two groups. This method is usually implemented 
with loss of information taken to be an increase in an error sum of squares criterion, 


Hierarchical Clustering Methods 693 


ESS. First, for a given cluster k, let ESS, be the sum of the squared deviations of 
every item in the cluster from the cluster mean (centroid). If there are currently K 
clusters, define ESS as the sum of the ESS, or ESS = ESS; + ESS, + ... + ESS,x. 
At each step in the analysis, the union of every possible pair of clusters is considered, 
and the two clusters whose combination results in the smallest increase in ESS (min- 
imum loss of information) are joined. Initially, each cluster consists of a single item, 
and, if there are N items, ESS, = 0, k = 1,2,..., N, so ESS = 0. At the other ex- 
treme, when all the clusters are combined in a single group of N items, the value of 
ESS is given by 


N 
ESS = 5 (x; ~ ¥)'(x; - ®) 


i=l 


where x; is the multivariate measurement associated with the jth item and x is the 
mean Of all the items. 

The results of Ward’s method can be displayed as a dendrogram. The vertical 
axis gives the values of ESS at which the mergers occur. 

Ward’s method is based on the notion that the clusters of multivariate observa- 
tions are expected to be roughly elliptically shaped. It is a hierarchical precursor to 
nonhierarchical clustering methods that optimize some criterion for dividing data 
into a given number of elliptical groups. We discuss nonhierarchical clustering pro- 
cedures in the next section. Additional discussion of optimization methods of cluster 
analysis is contained in [8]. 


Example J2.!0 (Clustering pure malt scotch whiskies) Virtually all the world’s pure 
malt Scotch whiskies are produced in Scotland. In one study (see [22]), 68 binary 
variables were created measuring characteristics of Scotch whiskey that can be 
broadly classified as color, nose, body, palate, and finish. For example, there were 
14 color characteristics (descriptions), including white wine, yellow, very pale, pale, 
bronze, full amber, red, and so forth. LaPointe and Legendre clustered 109 pure malt 
Scotch whiskies, each from a different distillery. The investigators were interested in 
determining the major types of single-malt whiskies, their chief characteristics, and 
the best representative. In addition, they wanted to know whether the groups pro- 
duced by the hierarchical] clustering procedure corresponded to different geograph- 
ical regions, since it is known that whiskies are affected by local soil, temperature, 
and water conditions, 

Weighted similarity coefficients {s;,} were created from binary variables repre- 
senting the presence or absence of characteristics. The resulting “distances,” defined 
as {d,, = 1 — s;,}, were used with Ward’s method to group the 109 pure (single-) 
malt Scotch whiskies. The resulting dendrogram is shown in Figure 12.11. (An aver- 
age linkage procedure applied to a similarity matrix produced almost exactly the 
same classification.) 

The groups labelled A-L in the figure are the 12 groups of similar Scotches 
identified by the investigators. A follow-up analysis suggested that these 12 
groups have a large geographic component in the sense that Scotches with similar 
characteristics tend to be produced by distilleries that are Jocated reasonably 


694 Chapter 12 Clustering, Distance Methods, and Ordination 


2 3 6 12 ~< Number of groups 


1.0 0.7 0.5 0.2 0.0 
l | 1) l ! 
A Aberfeldy 
Laphroaig 
Aberlour 
Macallan 
Balvenie 
B Lochside 
Dalmore 
————————  Gendullan 
Highland Park 
= Ardmore 
c Port Ellen 
~ Blair Athol 
D Auchentoshan 
7 Coleburm 
Balblair 
E Kinclaith 
Inchmunin 
Caol Ila 
Edradour 
Aultmore 
Benromach 
Cardhu 
Mihonduff 
Glen Deveron 
Bunnahabhain 
Glen Scotia 
Springbank 
Tomintoul 
Glengtassaugh 
Rosebank 
Brvichladdich 
Deanston 
H Glentauchers 
Glen Mhor 
Glen Spey 
Bowmore 
I Longrow 
Glenlochy 
Glenfarclas 
Glen Albyn 
Glen Grant 
coc North Port 
Glengoyne 
Balmenach 


Convalmore 
K —————— Glendronach 
Mortlach 
Glenordie 
Tormore 
Glen Elgin 
L Glen Garioch 
Giencadam 
Teaninich 


Glenugie 
Scapa 
Singleton 
Millbum 
Benrinnes 
Strathista 
Glenturret 
Glenlivet 
Oban 
Clynelish 
Talisker 
Glenmorangie 
Ben Nevis 
Speybum 
Litdemil) 
Bladnoch 
Inverleven 
Pulteney 
Glenburgie 
Glena}lachic 
Dalwhinnie 
Knockando 
Benriach 
Glenkinchie 
Tullibardine 
Inchgower 
Cragganmore 
Longmom 
Glen Moray 
Tamnavulin 
Glenfiddich 
Fettercaim 
Ladybum 
Tobermory 
Ardberg 
Lagavulin 
Duffiown 
Glenury Royal 
Jura 
Tamdhu 
Linkwood 
Saint Magdalene 
Glenlossie 
Tomatin 
Craigellachie 
Brackia 
Daijuaine 
Dallas Dhu 
Glen Keith 
Glenrothes 
Banff 
Caperdonich 
Lochnagar 
Imperial 


Figure 12.11 A dendrogram for similarities between 109 pure malt Scotch 


whiskies. 


close to one another. Consequently, the investigators concluded, “The relationship 
with geographic features was demonstrated, supporting the hypothesis that 
whiskies are affected not only by distillery secrets and traditions but also by fac- 
tors dependent on region such as water, soil, microclimate, temperature and even 


air quality.” 


Hierarchical Clustering Methods 695 


Final Comments—Hierarchical Procedures 


There are many agglomerative hierarchical clustering procedures besides single 
linkage, complete linkage, and average linkage. However, all the agglomerative pro- 
cedures follow the basic algorithm of (12-12). 

As with most clustering methods, sources of error and variation are not formal- 
ly considered in hierarchical procedures. This means that a clustering method will be 
sensitive to outliers, or “noise points.” 

In hierarchical clustering, there is no provision for a reallocation of objects that 
may have been “incorrectly” grouped at an early stage. Consequently, the final 
configuration of clusters should always be carefully examined to see whether it is 
sensible. 

For a particular problem, it is a good idea to try several clustering methods and, 
within a given method, a couple different ways of assigning distances (similarities). 
If the outcomes from the several methods are (roughly) consistent with one anoth- 
er, perhaps a case for “natural” groupings can be advanced. 

The stability of a hierarchical solution can sometimes be checked by applying 
the clustering algorithm before and after smal/ errors (perturbations) have been 
added to the data units. If the groups are fairly well distinguished, the clusterings 
before perturbation and after perturbation should agree. 

Common values (ties) in the similarity or distance matrix can produce multi- 
ple solutions to a hierarchical clustering problem. That is, the dendrograms corre- 
sponding to different treatments of the tied similarities (distances) can be 
different, particularly at the lower Jevels. This is not an inherent problem of any 
method; rather, multiple solutions occur for certain kinds of data. Multiple solu- 
tions are not necessarily bad, but the user needs to know of their existence so that 
the groupings (dendrograms) can be properly interpreted and different groupings 
(dendrograms) compared to assess their overlap. A further discussion of this issue 
appears in [27]. 

Some data sets and hierarchical clustering methods can produce inversions. 
(See [27].) An inversion occurs when an object joins an existing cluster at a smaller 
distance (greater similarity) than that of a previous consolidation. An inversion is 
represented two different ways in the following diagram: 


32 30 4 
30 324 
20 20 4 

0. 0] 


696 Chapter 12 Clustering, Distance Methods, and Ordination 


In this example, the clustering method joins A and B at distance 20. At the next 
step, C is added to the group (AB) at distance 32. Because of the nature of the clus- 
tering algorithm, D is added to group (ABC) at distance 30, a smaller distance than 
the distance at which C joined (AB). In (i) the inversion is indicated by a dendro- 
gram with crossover. In (ii), the inversion is indicated by a dendrogram with a non- 
monotonic scale. 

Inversions can occur when there is no clear cluster structure and are generally 
associated with two hierarchical clustering algorithms known as the centroid 
method and the median method. The hierarchical procedures discussed in this book 
are not prone to inversions. 


12.4 Nonhierarchical Clustering Methods 


Nonhierarchical clustering techniques are designed to group items, rather than vari- 
ables, into a collection of K clusters. The number of clusters, K, may either be speci- 
fied in advance or determined as part of the clustering procedure. Because a matrix 
of distances (similarities) does not have to be determined, and the basic data do not 
have to be stored during the computer run, nonhierarchical methods can be applied 
to much larger data sets than can hierarchical techniques. 

Nonhierarchical methods start from either (1) an initial partition of items into 
groups or (2) an initial set of seed points, which will form the nuclei of clusters. 
Good choices for starting configurations should be free of overt biases. One way to 
start is to randomly select seed points from among the items or to randomly parti- 
tion the items into initial groups. 

In this section, we discuss one of the more popular nonhierarchical procedures, 
the K-means method. 


K-means Method 


MacQueen [25] suggests the term K-means for describing an algorithm of his that 
assigns each item to the cluster having the nearest centroid (mean). In its simplest 
version, the process is composed of these three steps: 


1. Partition the items into K initial clusters. 

2. Proceed through the list of items, assigning an item to the cluster whose centroid 
(mean) is nearest. (Distance is usually computed using Euclidean distance with 
either standardized or unstandardized observations.) Recalculate the centroid 
for the cluster receiving the new item and for the cluster losing the item. 


3. Repeat Step 2 until no more reassignments take place. ( 12-16) 


Rather than starting with a partition of all items into K preliminary groups 
in Step 1, we could specify K initial centroids (seed points) and then proceed to 
Step 2. 

The final assignment of items to clusters will be, to some extent, dependent 
upon the initial partition or the initial selection of seed points. Experience suggests 
that most major changes in assignment occur with the first reallocation step. 


Nonhierarchical Clustering Methods 697 


Example 12.11 (Clustering using the K-means method) Suppose we measure two 
variables X, and X2 for each of four items A, B, C,and D.The data are given in the 
following table: 


Observations 


The objective is to divide these items into K = 2 clusters such that the 
items within a cluster are closer to one another than they are to the items in 
different clusters. To implement the K = 2-means method, we arbitrarily parti- 
tion the items into two clusters, such as (AB) and (CD), and compute the co- 
ordinates (X1, X2) of the cluster centroid (mean). Thus, at Step 1, we have 


Coordinates of centroid 
Cluster xy X2 


At Step 2, we compute the Euclidean distance of each item from the group 
centroids and reassign each item to the nearest group. If an item is moved from the 
initial configuration, the cluster centroids (means) must be updated before proceed- 
ing. The ith coordinate, i = 1,2,..., p, of the centroid is easily updated using the 
formulas: 


_ nx; + Xji . son: 
Xinew = Se. if the jth item is added to a group 
7 PRR ee a 9 
Xinew = - if the jth item is removed from a group 
Pau 
Here n is the number of items in the “old” group with centroid X’ = (X),X2,..-, Xp). 


Consider the initial clusters (AB) and (CD). The coordinates of the centroids are 
(2,2) and (—1, —2) respectively. Suppose item A.with coordinates (5, 3) is moved to 
the (CD) group. The new groups are (B) and (ACD) with updated centroids: 


2(2) — 5 2(2) -3 
Group (2) Xi new = 22) * =-1 X2 new = ” ; 1, the coordinates of B 
a 2(-1) +5 _ 2(-2) +3 
Group (ACD) X) new = “341 =1 X2 new = aa re ls = — 33 


698 Chapter 12 Clustering, Distance Methods, and Ordination 


Returning to the initial groupings in Step 1, we compute the squared distances 
d*(A,(AB)) = (5 — 2)? + (3 — 2)? = 10 
d?(A,(CD)) = (5 + 1)? + (3 + 2)? = 61 
d?(A(B)) = (5 + 1)? + (3 — 1)? = 40 
d’(A,(ACD)) = (5 — 1)? + (3 + 33)? = 27.09 


Since A is closer to the center of (AB) than it is to the center of (ACD), it is not 
reassigned, 
Continuing, we consider reassigning B. We get 


if A is not moved 


if A is moved to the (CD) grou» 


d?(B(AB)) = (-1 — 2)? + (1 - 2)? = 10 


2 ee 2 ohne if B is not moved 
d*(B(CD)) = (-1 + 1)? + (1+2)?=9 


2 ~ (162 2 
d'(BYA))) = (Sy eG 3) = 0 if Bis moved to the (CD) group 
d*(B,(BCD)) = (-1+1)?+(1+1)=4 
Since B is closer to the center of (BCD) than it is to the center of (AB), B is reas- 
signed to the (CD) group. We now have the clusters (A) and (BCD) with centroid 
coordinates (5,3) and (~1, —1) respectively. 

We check C for reassignment. 
a(C,(A)) = (1 — 5)? + (-2 ~3 = 41 
d(C,(BCD)) = (1 +1)? + (-2 +1) =5 


a'(C,(AC)) = (1 ~ 3) + (-2 ~ 5? = 10.25 if C is moved to the (A) group 
d*(C\(BD)) = (1 + 2)? + (-2 + 5)? = 11.25 


if Cis not moved 


Since C is closer to the center of the BCD group than it is to the center of the AC 
group, C is not inoved. Continuing in this way, we find that no more reassignments 
take place and the final K = 2 clusters are (A) and (BCD). 

For the final clusters, we have 


| Squared distances to 
group centroids 


Ttem 
Cluster A B Cc D 
A 0 40 41 89 
(BCD) 52 4 §*) 5 


The within cluster sum of squares (sum of squared distances to centroid) are 


Cluster A: 0 
Cluster (BCD): 4+5+5=14 


Equivalently, we can determine the K = 2 clusters by using the criterion 


min E = Sid? 


Nonhierarchical Clustering Methods 699 


where the minimum is over the number of K = 2 clusters and d? «iy is the squared 


distance of case i from the centroid (mean) of the assigned cluster. 
In this example, there are seven possibilities for K = 2 clusters: 


A, (BCD) 
B, (ACD) 
C, (ABD) 
D, (ABC) 
(AB), (CD) 


(AC), (BD) 
(AD), (BC) 


For the A, (BCD) pair: 
A d4aay=0 
(BCD)  d, qa) + dé.ac) + db,qp) = 4 +5 4+5= 14 


Consequently, 5\d? .¢;) = 0+ 14 = 14 
For the remaining pairs, you may verify that 
B,(ACD) Sid? «i = 48.7 
C,(ABD) Sid? (i) = 27.7 
D,(ABC) Sd? qi) = 313 
(AB), (CD) Sd? i) = 28 
(AC), (BD) Sid? .i = 27 
(AD), (BC) Sd? wy = 513 


Since the smallest } d? .;,) occurs for the pair of clusters (A) and (BCD), this is the 


fina] partition. = 


To check the stability of the clustering, it is desirable to rerun the algorithm with 
a new initia) partition. Once clusters are determined, intuitions concerning their in- 
terpretations are aided by rearranging the list of items so that those in the first clus- 
ter appear first, those in the second cluster appear next, and so forth. A table of the 
cluster centroids (means) and within-cluster variances also helps to delineate group 
differences. 


Example 12.12 (K-means clustering of public utilities) Let us return to the problem 
of clustering public utilities using the data in Table 12.4. The K-means algorithm for 
several choices of K was run. We present a summary of the results for K = 4 and 
K = 5. In general, the choice of a particular K is not clear cut and depends upon 
subject-matter knowledge, as well as data-based appraisals. (Data-based appraisals 
might include choosing K so as to maximize the between-cluster variability relative 


700 Chapter 12 Clustering, Distance Methods, and Ordination 


to the within-cluster variability. Relevant measures might include |W|/|B + W| 
[see (6-38)] and tr(W 1B).) The summary is as follows: 


K=4 


Number of 
Cluster firms _ Firms 


Idaho Power Co. (8), Nevada Power Co. (11), Puget 

Sound Power.& Light Co. (16), Virginia Electric & 

Power Co. (22), Kentucky Utilities Co. (9). 

Central Louisiana Electric Co. (3), Oklahoma Gas & Electric 
Co. (14), The Southern Co. (18), Texas Utilities Co. (19), 
Arizona Public Service (1), Florida Power & Light Co. (6). 


' New England Electric Co. (12), Pacific Gas & Electric 


Co. (15), San Diego Gas & Electric Co. (17), 
United Illuminating Co. (21), Hawaiian Electric Co. (7). 


Consolidated Edison Co. (N.Y.) (5), Boston Edison Co. 
(2), Madison Gas & Electric Co. (10), Northern States 
Power Co. (13), Wisconsin Electric Power Co. 

(20), Commonwealth Edison Co. (4). 


Distances between Cluster Centers 
1 2 3 4 

0 

3.08 0 

3.29 3.56 0 

3.05 2.84 3.18 0 


hk WN eS 


K=5 


Number of 
Cluster firms Firms 


Nevada Power Co. (11), Puget Sound Power & Light 
Co. (16), Idaho Power Co. (8), Virginia Electric & Power Co. 
(22), Kentucky Utilities Co. (9). 


Central Louisiana Electric Co. (3), Texas Utilities Co. (19), 
Oklahoma Gas & Electric Co. (14), The Southern Co. 
(18), Arizona Public Service (1), Florida Power & Light Co. (6). 


New England Electric Co. (12), Pacific Gas & Electric 
Co. (15), San Diego Gas & Electric Co. (17), United 
Illuminating Co. (21), Hawaiian Electric Co. (7). 


Consolidated Edison Co. (N.Y.) (5), Boston 
Edison Co. (2). 


Commonwealth Edison Co. (4), Madison Gas & Electric Co. (10), 
Northern States Power Co. (13), Wisconsin Electric Power Co. (20). 


we 
17) 
ao Ree OS 


Nonhierarchica] Clustering Methods 70! 


Distances between Cluster Centers 
1 2 3 4 5 


0 
3.08 0 
3.29 356 0 


3.63 3.46 2.63 0 
3.18 2.99 3.81 289 0 


Ak WN 


The cluster profiles (K = 5) shown in Figure 12.12 order the eight variables 
according to the ratios of their between-cluster variability to their within-cluster 
variability. [For univariate F-ratios, see Section 6.4.] We have 


mean square percent nuclear between clusters 3.335 


Fhuc = = 13.1 


mean square percent nuclear within clusters .255 


so firms within different clusters are widely separated with respect to percent nu- 
clear, but firms within the same cluster show little percent nuclear variation. Fuel 
costs (FUELC) and annual sales (SALES) also seem to be of some importance in 
distinguishing the clusters. 

Reviewing the firms in the five clusters, it is apparent that the K-means method 
gives results generally consistent with the average linkage hierarchical method. (See 
Example 12.9.) Firms with common or compatible geographical locations cluster. 
Also, the firms ina given cluster seem to be roughly the same in terms of percent 
nuclear.  ] 


We must caution, as we have throughout the book, that the importance of 
individual variables in clustering must be judged from a multivariate perspective. 
All of the variables (multivariate observations) determine the cluster means and 
the reassignment of items. In addition, the values of the descriptive statistics 
measuring the importance of individual variables are functions of the number of 
clusters and the final configuration of the clusters. On the other hand, descriptive 
measures can be helpful, after the fact, in assessing the “success” of the clustering 
procedure. 


Final Comments—Nonhierarchical Procedures 


There are strong arguments for not fixing the number of clusters, K, in advance, 
including the following: 


L If two or more seed points inadvertently lie within a single cluster, their resulting 
clusters will be poorly differentiated. 


“erep Aptin oyqnd 105 (¢ = Yy) safyord raasniD ZZ aunai4 


“UROL MO[9q PUP SAOGE UONHEIAap plepue)s avo Jeotpul sayseq. 
‘aiqelsea yoea jo uvour at) 12 payuld s} Joquunu JaysNyo aYL 
“Jaysnyd & saquasap uUIN[ao YR 


e e e e e 

$0 ---¢-- sees prc woe cee oe -g-- a  -- ee mart H Bernt Heer pH ----- ones a8eiaa09 ofmeys-paxty 
a4 =o- SSS pss bord Stas! ----%-~-- ----[--- [edes uo winiad Jo Jey 
Lz -o-G--- ietel Aated ~€- Salat Atte! --1-- (unaid puewap YMy Xead 
vP Heo ~--p-- ~--¢-- -7-- Sap Jojoey peo] jenuuy 
"S --g-- --p- -¢€ --%- ---]--- aoed ur Ai1oedes ayy Jad 3s0> 
r8 -¢ Spe =e =f Ses ils Sajeg 
ral S- =P aad Boas -@-- --|- $1809 [any JB)oy, 
Vel -S- dl Sas = -z@ --yr- seafonu jaded 
onei-y e e e e e 


azis onel-4 AG palaplo ae sa}getea—sayryold sa)sn|-> 


702 


Clustering Based on Statistical Models 703 


2. The existence of an outlier might produce at least one group with very disperse 
items. 


3. Even if the population is known to consist of K groups, the sampling method 
may be such that data from the rarest group do not appear in the sample. Forc- 
ing the data into K groups would lead to nonsensical clusters. 


In cases in which a single run of the algorithm requires the user to specify K, it 
is always a good idea to rerun the algorithm for several choices. 

Discussions of other nonhierarchical clustering procedures are available in [3], 
[8], and [16]. 


12.5 Clustering Based on Statistical Models 


The popular clustering methods discussed earlier in this chapter, including single 
linkage, complete linkage, average linkage, Ward’s method and K-means cluster- 
ing, are intuitively reasonable procedures but that is as much as we can say with- 
out having a model to explain how the observations were produced. Major 
advances in clustering methods have been made through the introduction of sta- 
tistical models that indicate how the collection of (p X 1) measurements x;, from 
the N objects, was generated. The most common model is one where cluster k has 
expected proportion p, of the objects and the corresponding measurements are 
generated by a probability density function f,(x). Then, if there are K clusters, the 
observation vector for a single object is modeled as arising from the mixing distri- 
bution 


K 
Fauix(x) = DPe fx) 


where each p, = 0 and Sik = 1. This distribution fy4i,(x) is called a mixture of 
the K distributions f,(x), fp(x),..., f(x) because the observation is generated 
from the component distribution f,(x) with probability p,. The collection of N ob- 
servation vectors generated from this distribution will be a mixture of observations 
from the component distributions, 

The most common mixture model is a mixture of multivariate normal distribu- 
tions where the k-th component f;,(x) is the N,(#,, X&,) density function. 

The normal mixture model for one observation x is 


Fix(x | M1, x1 ++ 9 KK Xx) 


= 1 i e 
- 2a? omer E, [12 exp ( 3 (x = py) E(x - rm) (12-17) 
1 k 


Clusters generated by this model are ellipsoidal in shape with the heaviest concen- 
tration of observations near the center. 


704 Chapter 12 Clustering, Distance Methods, and Ordination 


Inferences are based on the likelihood, which for N objects and a fixed number 
of clusters K, is 


N 
L(pi,..-> PK, Bt, 2as-- +, his UK) = TI faix(x; | fa, 2a, -.-, HK, UK) 
j=l 


7 1 Soca exp (-4 (x) — oa)"Ee (x - ™) (12-18) 


where the proportions p),..., pg, the mean vectors y;;..., #,, and the covariance 
matrices %j,..., 2%, are unknown. The measurements for different objects are 
treated as independent and identically distributed observations from the mixture. 
distribution. 

There are typically far too many unknown parameters for parameters for mak- 
ing inferences when the number of objects to be clustered is at least moderate. 
However, certain conclusions can be made regarding situations where a heuristic 
clustering method should work well. In particular, the likelihood based procedure 
under the normal mixture model with all 2, the same multiple of the identity 
matrix, 7I, is approximately the same as K-means clustering and Ward’s method. 
To date, no statistical models have been advanced for which the cluster formation 
procedure is approximately the same as single linkage, complete linkage or average 
link age. 

Most importantly, under the sequence of mixture models (12-17) for different 
K, the problems of choosing the number of clusters and choosing an appropriate 
clustering method has been reduced to the problem of selecting an appropriate sta- 
tistical model. This is a major advance. 

A good approach to selecting a model is to first obtain the maximum likelihood 
estimates p,,..., Px, ft, 241,-.-, &K, &x for a fixed number of clusters K. These es- 
timates must be obtained numerically using special purpose software. The resulting 
value of the maximum of the likelihood 


Dmax = L(p,- 78 » PK: Ay, > ese BK: Xx) 


provides the basis for model selection. How do we decide on a reasonable value for 
the number of clusters K? In order to compare models with different numbers 
of parameters, a penalty is subtracted from twice the maximized value of the 
log-likelihood to give 


~2In Lax — Penalty 


where the penalty depends on the number of parameters estimated and the number 
of observations N. Since the probabilities p, sum to 1, there are only K — 1 proba- 
bilities that must be estimated, K X p means and K x p(p+ 1)/2 variances and 
covariances. For the Akaike information criterion (AIC), the penalty is 
2N X (number of parameters) so 


AIC = 2In Liga = 2N(K4(p + 1N(0 +2) - 1) (12-19) 


Clustering Based on Statistical Models 705 


The Bayesian information criterion (BIC) is similar but uses the logarithm of the 
number of parameters in the penalty function 


1 
BIC = 2In Linx — zintw)( x4 (p + 1)(p +2) - 1) (12-20) 
There is still occasional difficulty with too many parameters in the mixture model so 


simple structures are assumed for the 2,. In particular, progressively more compli- 
cated structures are allowed as indicated in the following table. 


Assumed form Total number 

for X, of parameters BIC 

x, =71 K(pt+ 1) In Linax — 2In(N)K(p + 1) 

y= m%I : K(p+2)-1 In Lmax — 2In(N)(K(p + 2) - 1) 


Xy = 1, Diag(aAr,Az,...,A4p) K(p+2)+p-1 InLy,, — 2In(N)(K(p + 2) + p- 1) 


Additional structures for the covariance matrices are considered in [6] and [9]. 

Even for a fixed number of clusters, the estimation of a mixture model is 
complicated. One current software package, MCLUST, available in the R software 
library, combines hierarchical clustering, the EM algorithm and the BIC criterion to 
develop an appropriate model for clustering. In the ‘E’-step of the EM algorithm, a 
(N X K) matrix is created whose jth row contains estimates of the conditional (on 
the current parameter estimates) probabilities that observation x; belongs to cluster 
1,2,...,K.So, at convergence, the jth observation (object) is assigned to the cluster 
k for which the conditional probability 


K 
P(KIx;) = By f(x; k) DPF (x| k) 


of membership is the largest. (See [6] and [9] and the references therein.) 


Example {2.13 (A model based clustering of the iris data} Consider the Iris data in 
Table 11.5. Using MCLUST and specifically the me function, we first fit the p = 4 
dimensional normal mixture model restricting the covariance matrices to satisfy 
xX, = mUk = 1, 2, 3. 

Using the BIC criterion, the software chooses K = 3 clusters with estimated 


centers 
5.01 5.90 6.85 
3.43 2.75 3.07 
Be Tabs PE gag Po Pe | 599) 
0.25 1.43 2.07 


and estimated variance-covariance scale factors 7, = .076, 72 = .163 and 73 = .163. 
The estimated mixing proportions are p; = .3333, p. = .4133 and p; = .2534. For 
this solution, BIC = —853.8. A matrix plot of the clusters for pairs of variables is 
shown in Figure 12.13. 

Once we have an estimated mixture model, a new object x; will be assigned to the 
cluster for which the conditional probability of membership is the largest (see [9]). 

Assuming the ©, = 7,1 covariance structure and allowing up to K = 7 clus- 
ters, the BIC can be increased to BIC = —705.1. 


706 Chapter 12 Clustering, Distance Methods, and Ordination 


Sepal.Length 


20 2.5 3.0 35 40 05-10 15 2.0 2.5 


Petal.Length 


= 
4.5 5.0 5.5 6.06.5 7.07.5 8.0 i123 4 5 67 


Petal. Width 


Figure 12.13 Multiple scatter plots of K = 3 clusters for Iris data 


Finally, using the BIC criterion with up to K = 9 groups and several different 
covariance structures, the best choice is a two group mixture model with uncon- 
strained covariances. The estimated mixing probabilities are p, = .3333 and 
p2 = .6667. The estimated group centers are 


5.01 6.26 5 
_ | 3.43 _ | 2.87 
mS li46P | 41 
0.25 1.68 
and the two estimated covariance matrices are 
-1218 .0972 .0160 .0101 4530 .1209 .4489 .1655 
is 0972 1408 .0115 .0091 3 = 1209 1096 .1414 .0792 
0160 0115 .0296 .0059 Fi 4489 1414 6748 .2858 
0101 .0091 .0059 .0109 1655 0792 .2858 = .1786 
Essentially, two species of Iris have been put in the same cluster as the projected 
view of the scatter plot of the sepal measurements in Figure 12.14 shows. m 


12.6 Multidimensional Scaling 


This section begins a discussion of methods for displaying (transformed) multivari- 
ate data in low-dimensional space. We have already considered this issue whezi we 


Multidimensional Scaling 707 


A 
A 
A 
40 “ 
A 
a a | 
A 
o 
3.5 - 
= * a 
z 
= o 0 0o 
a 
o 
~” 3.0 
2.5 
2. 
4.5 5.0 5.5 6.0 6.5 7.0 75 8.0 
Sepal.Length 


Figure 12.14 Scatter plot of sepal measurements for best model. 


discussed plotting scores on, say, the first two principal components or the scores on 
the first two linear discriminants. The methods we are about to discuss differ from 
these procedures in the sense that their primary objective is to “fit” the original data 
into a low-dimensional coordinate system such that any distortion caused by a re- 
duction in dimensionality is minimized. Distortion generally refers to the similari- 
ties or dissimilarities (distances) among the original data points. Although 
Euclidean distance may be used to measure the closeness of points in the final low- 
dimensional configuration, the notion of similarity or dissimilarity depends upon 
the underlying technique for its definition. A low-dimensional plot of the kind we 
are alluding to is called an ordination of the data. 

Multidimensional scaling techniques deal with the following problem: For a set 
of observed similarities (or distances) between every pair of N items, find a repre- 
sentation of the items in few dimensions such that the interitem proximities “nearly 
match” the original similarities (or distances). é 

It may not be possible to match exactly the ordering of the original similarities 
(distances). Consequently, scaling techniques attempt to find configurations in 
q = N — 1 dimensions such that the match is as close as possible. The numerical 
measure of closeness is called the stress. 

It is possible to arrange the N items in a low-dimensional coordinate system using 
only the rank orders of the N(N — 1)/2 original similarities (distances), and not their 
magnitudes. When only this ordinal information is used to obtain a geometric repre- 
sentation, the process is called nonmetric multidimensional scaling. If the actual magni- 
tudes of the original similarities (distances) are used to obtain a geometric 
representation in q dimensions, the process is called metric multidimensional scaling. 
Metric multidimensional scaling is also known as principal coordinate analysis. 


708 Chapter 12 Clustering, Distance Methods, and Ordination 


Scaling techniques were developed by Shepard (see {29] for a review of early 
work), Kruskal [19, 20, 21], and others. A good summary of the history, theory, and 
applications of multidimensional scaling is contained in [35]. Multidimensional 
scaling invariably requires the use of a computer, and several good computer 
programs are now available for the purpose. 


The Basic Algorithm 


For AN items, there are M@ = N(N ~ 1)/2 similarities (distances) between pairs o¢ 
different items. These similarities constitute the basic data. (In cases where the simi. 
larities cannot be easily quantified as, for example, the similarity between two co} 
ors, the rank orders of the similarities are the basic data.) 

Assuming no ties, the similarities can be arranged in a strictly ascending order as 
whe (12-21) 


Here 5;,,, is the smallest of the M similarities. The subscript i,k, indicates the pai: 
of items that are least similar—that is, the items with rank 1 in the similarity 
ordering. Other subscripts are interpreted in the same manner. We want to find a 
q-dimensional configuration of the N items such that the distances, di , betweer 
pairs of items match the ordering in (12-21). If the distances are laid out in a manner 
corresponding to that ordering, a perfect match occurs when 

dQ > di >--- > did, (12-22; 
That is, the descending ordering of the distances in q dimensions is exactly analo- 
gous to the ascending ordering of the initial similarities. As long as the order in 
(12-22) is preserved, the magnitudes of the distances are unimportant. 

For a given value of q, it may not be possible to find a configuration of points 
whose pairwise distances are monotonically related to the original similarities. 
Kruskal (19] proposed a measure of the extent to which a geometrical representa- 
tion falls short of a perfect match. This measure, the stress, is defined as 


> > (a? - a? ; 

Stress(q) = 4 “4 (12-23) 
SSI da‘) 2 
i<k 

The dy: Sin the stress formula are numbers known to satisfy (12-22); that is, they 
are monotonically related to the similarities. The a{? 2)>s are not distances in the sense 
that they satisfy the usual distance properties of (1. -25). They are merely reference 

numbers used to judge the nonmonotonicity of the observed djj’’s. 
The idea is to find a representation of the items as points in q-dimensions such 
that the stress is as small as possible. Kruskal [19] suggests the stress be informally 

interpreted according to the following guidelines: 


Siyky < Sink, Sees S; 


Stress Goodness of fit 
20% Poor 
10% Fair 
5% Good (12-24) 
2.5% Excellent 
0% Perfect 


Goodness of fit refers to the monotonic relationship between the similarities and the 
fina] distances. 


Multidimensional Scaling 709 


A second measure of discrepancy, introduced by Takane et al. [31], is becoming 
the preferred criterion, For a given dimension q, this measure, denotea by SStress 
replaces the d;,’s and d;,’s in (12-23) by their squares and is given by 


DE ah - By |” 
SStress = | (12-25) 
LD dik 
i<k 
The value of SStress is always between 0 and 1. Any value less than .1 is typically 
taken to mean that there is a good representation of the objects by the points in the 
given configuration. 

Once items are located in q dimensions, their g x 1 vectors of coordinates can be 
treated as multivariate observations. For display purposes, it is convenient to represent 
this q-dimensional scatter plot in terms of its principal component axes. (See Chapter 8.) 

We have written the stress measure as a function of q, the number of dimensions 
for the geometrical representation. For each q, the configuration leading to the min- 
imum stress can be obtained. As q increases, minimum stress will, within rounding 
error, decrease and will be zero for g = N — 1. Beginning with q = 1, a plot of 
these stress (q) numbers versus g can be constructed. The value of q for which this 
plot begins to level off may be selected as the “best” choice of the dimensionality. 
That is, we look for an “elbow” in the stress-dimensionality plot. 

The entire multidimensional scaling algorithm is summarized in these steps: 

1. For N items, obtain the M = N(N — 1)/2 similarities (distances) between dis- 
tinct pairs of items Order the similarities as in (12-21). (Distances are ordered’ 
from largest to smallest.) If similarities (distances) cannot be computed, the 
rank orders must be specified. 

2. Using a trial configuration in g dimensions, determine the interitem distances d (9) 
and numbers d\?), where the latter satisfy (12-22) and minimize the stress (12-23) or 
SStress (12-25). (The a? are frequently determined within scaling computer pro- 
grams using regression methods designed to produce monotonic “fitted” distances.) 


3. Using the d\?'s, move the points around to obtain an improved configuration. 
(For q fixed, an improved configuration is determined by a general function 
minimization procedure applied to the stress. In this context, the stress is re- 
garded as a function of the N X_q coordinates of the N items.) A new configu- 
ration will have new d'?’s new das and smaller stress. The process is repeated 
until the best (minimum stress) representation is obtained. 

4, Plot minimum stress (q¢) versus g and choose the best number of dimensions, q*, 
from an examination of this plot. (12-26) 


We have assumed that the initial similarity values are symmetric (s;, = 5,;), that 
there are no ties, and that there are no missing observations. Kruskal {19, 20] has 
suggested methods for handling asymmetries, ties, and missing observations. In ad- 
dition, there are now multidimensional scaling computer programs that will handle 
not only Euclidean distance, but any distance of the Minkowski type. [See (12-3).] 

The next three examples illustrate multidimensional scaling with distances as 
the initial (dis)similarity measures. 


Example 12.14 (Multidimensional scaling of U.S. cities) Table 12.7 displays the 
airline distances between pairs of selected U.S. cities. 


[ 0 1787 910T 6LL O8b7 716 cL6 LLOT £86 8% 6LET Loy (ZI) 


0 Oz81 207 L721 8861 6S6T T68T TE1Z 190 Lyltz Love (TT) 
0 P6Z arel €S€ pEZ Sv9 60 SEE SLIT ass (oT) 
0 Esl LEI 9EPr por 9g¢ 20S Ssel 99€ (6) 
0 IOLI 0802 €OPT Spez 981Z 7SOE = LOTZ_ (8) 
0 Z9¢ Sze StL 819 port sos (ZL) 
) 788 ZL1 801 1¥6 gos (9) |. 
0 asor £v6 6st sog (Ss) | 
0 Lol 694 6bS (pb) 
0 198 Top  (€) 
0 g901 (2) 
0 (1) 
(71) (1D) (01) (6) (8) (Z) (9) (s) (r) (€) (2) (1) 
edurey sueyods smnoy yg styduy sojosuy soy yooy op] slodeuempuy seyeq snquinjop yeuunmy uosog euepy 
j BEC SdUEISIC-oUNITY LTT IqeL || 


Multidimensional Scaling 711 


Spokane 
: Boston 
Be st 
Columbus 
mae Indianapolis 
© e@ Cincinnati 
7 @ St. Louis 
oe Atlanta 
@ Memphis @ 
l Los Angeles Dalle @ Little Rock 
: e 
-4r ; Tampa 
=-8- 
al | | ae i, asp 
—2.0 =1.5 -1.0 pacts 0 5 1.0 1.5 


Figure 12.15 A geometrical representation of cities produced by multidimensional 
scaling. 


Since the cities naturally lie in a two-dimensional space (a nearly level part of the 
curved surface of the earth), it is not surprising that multidimensional scaling with 
q = 2 will locate these items about as they occur on a map. Note that if the distances 
in the table are ordered from largest to smallest—that is, from a least similar to most 
similar—the first position is occupied by dposton,L.a, = 3052. 

A multidimensional scaling plot for g = 2 dimensions is shown in Figure 12.15. 
The axes lie along the sample principal components of the scatter plot. 

A plot of stress (q) versus q is shown in Figure 12.16 on page 712. Since 
stress (1) X 100% = 12%, a representation of the cities in one dimension (along a 
single axis) is not unreasonable. The “elbow” of the stress function occurs at g = 2. 
Here stress (2) X 100% = 0.8%, and the “fit” is almost perfect. 

The plot in Figure 12.16 indicates that q = 2 is the best choice for the dimen- 
sion of the final configuration. Note that the stress actually increases for q = 3. 
This anomaly can occur for extremely small values of stress because of difficulties 
with the numerical search procedure used to locate the minimum stress. = 


Example [2.15 (Multidimensional scaling of public utilities) Let us try to represent 
the 22 public utility firms discussed in Example 12.7 as points in a low-dimensional 
space. The measures of (dis)similarities between pairs of firms are the Euclidean 
distances listed in Table 12.6. Multidimensional scaling ing = 1, 2,..., 6 dimensions 
produced the stress function shown in Figure 12.17. 


712 Chapter 12 Clustering, Distance Methods, and Ordination 


05 


—.05 


02 


Stress 


14 


12 


10 


— 3 


[ee ee a | | L»@g 


2 4 6 8 


Figure 12.17 Stress function for distances between utilities. 


Multidimensional Scaling 713 


15 
San Dieg. G&E | 
e 
Con. Ed. 
1.0 Fs Unit. IL Co. Haw. E]. e 
e e 
e 
Pac. G&E @ N. Eng. El. 
a 
e 
Bost. Ed. Kent Util. 
e 
vere Flor. Po. & Lr. 
0 e ° 
Southern Co. @ > © WEPCO 
Pug. Sd. Po. e Pe * : es ee 
=5 e Idaho Po. riz. Pub. Ser. WieeE: . 
Cent. Louis. 
NSP. 
e 
—1.0 [-Nev. Po. 
e 
e e Ok. G. & E. 
| Tex. Util. 
-1.5 ‘ 
tll ee re NS a es, \ 4 wi 
-15 -1.0 -5 0 5 io i 


Figure 12.18 A geometrical representation of utilities produced by multidimensional 
scaling. 


The stress function in Figure 12.17 has no sharp elbow. The plot appears to level 
out at “good” values of stress (less than or equal to 5%) in the neighborhood of 
q = 4.A good four-dimensional representation-.of the utilities is achievable, but dif- 
ficult to display. We show a plot of the utility configuration obtained in q = 2 di- 
mensions in Figure 12.18. The axes lie along the sample principal components of the 
final scatter. 

Although the stress for two dimensions is rather high (stress (2) x 100% = 
19% ), the distances between firms in Figure 12.18 are not wildly inconsistent with 
the clustering results presented earlier in this chapter. For example, the midwest 
utilities—Commonwealth Edison, Wisconsin Electric Power (WEPCO), Madison 
Gas and Electric (MG & E), and Northern States Power (NSP)—are close together 
(similar). Texas Utilities and Oklahoma Gas and Electric (Ok. G & E) are also very 
close together (similar). Other utilities tend to group according to geographical 
locations or similar environments. 

The mentee cannot be positioned in two dimensions such that the interutility 
distances a?) are entirely consistent with the original distances in Table 12.6. More 
flexibility for positioning the points is required, and this can only be obtained by in- 
troducing additional dimensions. = 


Example 12.16 (Multidimensional scaling of universities) Data related to 25 U.S. 
universities are given in Table 12.9 on page 729. (See Example 12.19.) These data 
give the average SAT score of entering freshmen, percent of freshmen in top 


7!4 Chapter 12 Clustering, Distance Methods, and Ordination 


4 
2b 
UVirginia NotreDame Brown 
UCBerkeley Georgetown Duke Harvard 
TexasA&M PennState was DartmoythPrinceton 
UMichigan UPen tanford yale 
OF Northwestern “Columbia ry 
Purdue UWisconsin UChicago 
CamegieMelion 
ae JohnsHopkins CalTech 
-4 — 


-4 -2 0 2 


Figure 12.19 A two-dimensional representation of universities produced by metric 
multidimensional scaling. : 


10% of high school class, percent of applicants accepted, student-faculty ratio, esti- 
mated annual expense, and graduation rate (%). A metric multidimensional scaling 
algorithm applied to the standardized university data gives the two-dimensional 
representation shown in Figure 12.19. Notice how the private universities cluster 
on the right of the plot while the large public universities are, generally, on the left. 
A nonmetric multidimensional scaling two-dimensional configuration is shown in 
Figure 12.20. For this example, the metric and nonmetric scaling representations 
are very similar, with the two dimensional stress value being approximately 10% 
for both scalings. : = 


Classical metric scaling, or principal coordinate analysis, is equivalent to ploting 
the principal components. Different software programs choose the signs of the ap- 
propriate eigenvectors differently, so at first sight, two solutions may appear to be 
different. However, the solutions will coincide with a reflection of one or more of 
the axes, (See [26].) 


Multidimensional Scaling 715 


4 
2 
UCBerkeley 
TexasA&M 
Georgetown 
B: 
vienna a poem Princeton 
Cornell 
PennState Duke Dartmouth es 
ichi UPenn 
0 UNichigan Northwestern Stanford 
Columbia Yale 
MIT 
Purdue UWisconsin UChicago 
-2 CamegieMelion 
JohnsHopkins CalTech 
—4h 
a ae ae he a et i 


~4 2 0 2 


Figure 12.20 A two-dimensional representation of universities produced by nonmetric 
multidimensional scaling. 


To summarize, the key objective of multidimensional scaling procedures is a 
low-dimensional picture. Whenever multivariate data can be presented graphically 
in two or three dimensions, visual inspection can greatly aid interpretations. 

When the multivariate observations are naturally numerical, and Euclidean dis- 
tances in p-dimensions, dip? can be computed, we can seek a gq < p-dimensional 
representation by minimizing 


e-| 35 (asp) ~ d‘?) a | [ss ap | (12-27) 


i<k 


In this alternative approach, the Euclidean distances in p and q dimensions are 
compared directly. Techniques for obtaining low-dimensional representations by 
minimizing E are called nonlinear mappings. 

The final goodness of fit of any low-dimensional representation can be 
depicted graphically by minimal spanning trees. (See [16] for a further discussion of 
these topics.) 


716 Chapter 12 Clustering, Distance Methods, and Ordination 


12.7 Correspondence Analysis 


Developed by the French, correspondence analysis is a graphical procedure for rep- 
resenting associations in a table of frequencies or counts. We will concentrate on a 
two-way table of frequencies or contingency table. If the contingency table has J 
rows and J columns, the plot produced by correspondence analysis contains two sets 
of points: A set of J points corresponding to the rows and a set of J points corre- 
sponding to the columns. The positions of the points reflect associations. 

Row points that are close together indicate rows that have similar profiles (con- 
ditional distributions) across the columns. Column points that are close together in- 
dicate columns with similar profiles (conditional distributions) down the rows. 
Finally, row points that are close to column points represent combinations that 
occur more frequently than would be expected from an independence model—that 
is, a model in which the row categories are unrelated to the column categories. 

The usual output from a correspondence analysis includes the “best” two- 
dimensional representation of the data, along with the coordinates of the plotted 
points, and a measure (called the inertia) of the amount of information retained in 
each dimension. 

Before briefly discussing the algebraic development of correspondence analy- 
sis, it is helpful to illustrate the ideas we have introduced with an example. 


Example 12.17 (Correspondence analysis of archaeological data) Table 12.8 contains 
the frequencies (counts) of J = 4 different types of pottery (called potsherds) 
found at J = 7 archaeological sites in an area of the American Southwest. If we 
divide the frequencies in each row (archaeological site) by the corresponding row 
total, we obtain a profile of types of pottery. The profiles for the different sites 
(rows) are shown in a bar graph in Figure 12.21(a). The widths of the bars are 
proportional to the total row frequencies. In general, the profiles are different; 
however, the profiles for sites P1 and P2 are similar, as are the profiles for sites 
P4 and PS. 

The archaeological site profile for different types of pottery (columns) are 
shown in a bar graph in Figure 12.21(b). The site profiles are constructed using the 


| Table 12.8 Frequencies of Types of Pottery 
Type 

Site A B C D Total 
PO 30 10 10 39 89 
al 53 4 16 2 75 
P2 73 1 41 1 116 
P3 20 6 1 4 31 
P4 46 36 37 13 132 
PS 45 6 59 10 120 
P6 16 28 169 5 218 
Total | 283 91 333 74 «| 781 
Source: Data courtesy of M. J. Tretter. 


Type By Site) 


Correspondence Analysis 717 


Site by Type 


a 


° 


o 


soit 


ot Ges an 
pO pl p2 p3 p4 ps pé a b c d 


Site Type 


(a) (b) 


Figure 12.2! Site and pottery type profiles for the data in Table 12.8. 


column totals. The bars in the figure appear to be quite different from one another. 
This suggests that the various types of pottery are not distributed over the archaeo- 
logical sites in the same way. 

The two-dimensional plot from a correspondence analysis” of the pottery 
type-site data is shown in Figure 12.22. 

The plot in Figure 12.22 indicates, for example, that sites P1 and P2 have similar 
pottery type profiles (the two points are close together), and sites PO and P6 have very 
different profiles (the points are far apart). The individual points representing the 
types of pottery are spread out, indicating that their archaeological site profiles are 
quite different. These findings are consistent with the profiles pictured in Figure 12.21. 

Notice that the points PO and D are quite close together and separated from the 
remaining points. This indicates that pottery type D tends to be associated, almost 
exclusively, with site PO. Similarly, pottery type A tends to be associated with site P1 
and, to lesser degrees, with sites P2 and P3. Pottery type B is associated with sites P4 
and PS, and pottery type C tends to be associated, again, almost exclusively, with site 
P6. Since the archaeological sites represent different periods, these associations are 
of considerable interest to archaeologists. 

The number Aj = .28 at the end of the first coordinate axis in the two- 
dimensional plot is the inertia associated with the first dimension. This inertia is 55% 
of the total inertia. The inertia associated with the second dimension is AZ = .17, and 
the second dimension accounts for 33% of the total inertia. Together, the two di- 
mensions account for 55% + 33% = 88% of the total inertia. Since, in this case, the 
data could be exactly represented in three dimensions, relatively little information 
(variation) is lost by representing the data in the two-dimensional plot of 
Figure 12.22. Equivalently, we may regard this plot as the best two-dimensional rep- 
resentation of the multidimensional scatter of row points and the multidimensional 


2The JMP software was used for a correspondence analysis of the data in Table 12.8. 


718 Chapter 12 Clustering, Distance Methods, and Ordination 


Aza i 


10 


43 = 173%) 


-1.0 -0.5 0.0 05 1.0 
c2 


Lx) Type (ol Site 


Figure 12.22 A correspondence analysis plot of the pottery type-site data. 


scatter of column points. The combined inertia of 88% suggests that the representa- 
tion “fits” the data well. 

In this example, the graphical output from a correspondence analysis shows the 
nature of the associations in the contingency table quite clearly. = 


Algebraic Development of Correspondence Analysis 


To begin, let X, with elements x;;, be an J X J two-way table of unscaled fre- 
quencies or counts. In our discussion we take J > J and assume that X is of full 
column rank J. The rows and columns of the contingency table X correspond to 
different categories of two different characteristics. As an example, the array of 
frequencies of different pottery types at different archaeological sites shown in 
Table 12.8 is a contingency table with J = 7 archaeological sites and J = 4 pot- 
tery types. 

If n is the total of the frequencies in the data matrix X, we first construct a ma- 
trix of proportions P = {p;;} by dividing each element of X by n. Hence 


cn i 
=, 4=1,2,...,1, jf =1,2,...,J, or P =— X (12-28) 
n Ux) on 


The matrix P is called the correspondence matrix. 


Correspondence Analysis 719 


Next define the vectors of row and column sums r and ¢ respectively, and the 
diagonal matrices D, and D, with the elements of r and c on the diagonals. Thus 


Jj J x; 

ij : 

r= Pij = au i= 1,2,...,/, or r = P 1 

» i »» n (1x1) (IxJ)(JX1) (12-29) 
12- 

! xy; ; 

ee stingy c = 
j=) 7% (Jx1) (xx) 


~ 
I 
an 
N 
San 
° 
ca 


I 
q= > P= 
i=1 
where 1, isa J X land1,isaJ xX 1 vector of 1’s and 
D, = diag(, 7,...,7;) and D, = diag(c,,c2,...,c,) (12-30) 


We define the square root matrices 


: 1 1 
D!? = diag(Vn,..., Wry D;'? = diag| —=,..., = 
iag ( r , r)) lag Vr? ’ Vr 
(12-31) 
1 
D!” = diag(Vc,..., Vc; D,)7 = diag| —=,..., = 
jag (Vc C7) iag ve Ve 


for scaling purposes. 
Correspondence analysis can be formulated as the weighted least squares prob- 
lem to select P = {p;;}, a matrix of specified reduced rank, to minimize 


i Bi? 


Eak ADs 
> » TiC; 


since (p;; — ,;)/ Wri; is the (i, /) element of D712(P — P)Dz'”. 


= tr[(D;(P — P)D7) (D7?7(P — P)D;1)'] (12-32) 


As Result 12.1 demonstrates, the term re’ is common to the approximation P 
whatever the J X J correspondence matrix P. The matrix P = re’ can be shown to 
be the best rank 1 approximation to P. 


Result 12.1. The term re’ is common to the approximation P whatever the J x J 
correspondence matrix P. 

The reduced rank s approximation to P, which minimizes the sum of squares 
(12-32), is given by 


5 ~ 5 ~ 
P= > A,(D/? t,) (DY? ¥;)' = re’ + SA, (D}? %,) (D2? ¥,)' 
k=1 k=2 


where the dg are the singular values and the J X 1 vectors ti, and the J X 1 vectors 
¥, are the corresponding singular vectors of the J x J matrix D7¥*PD;'/7. The 
J 


minimum value of (12-32) is {A}. 
k=st+1 
The reduced rank K > 1 approximation to P — re’ is 


K 
P— re’ = > A,(D}7 u,) (DY? 4) (12-33) 
k=1 


720 Chapter 12 Clustering, Distance Methods, and Ordination 


where the A, are the singular values and the J X 1 vectors u, and the J < 1 vectors 
v, are the corresponding singular vectors of the J x J matrix D717(P — re’) Dz, 
Here Ax ins Ag+1> Uy, =. Ta, and vy = Vest for k = | eee J-1. 


Proof. We first consider a scaled version B = D;!?PD-;!” of the correspondence 
matrix P. According to Result 2A.16, the best low rank = s approximation B to 
D;?PD-1/ is given by the first s terms in the the singular-value decomposition 


J 
D??PD,”? = DY Aa (12-34) 
; k=1 . 
where 
DPD 274, = 2,0, D7 PD?? = 1,4; (12-35) 
and 


\(D7?PDz!”) (D7!72PDz")' — XtI| =0 for k= 1,...,J 


The approximation to P is then given by 


Ss 
P= D/?BD? = >> Ay (D7) (DZ?¥;,)' 


J 
and, by Result 2A.16, the error of approximation is > iG 
k=s+1 
Whatever the correspondence matrix P, the term re’ always provides a (the 
best) rank one approximation. This corresponds to the assumption of independence 
of the rows and columns. To see this, let 7, = D!/*1, and ¥, = D!/*1,, where 1; isa 
I X land 1,aJ X 1 vector of 1’s. We verify that (12-35) holds for these choices. 


(D1,)'(D;'?PD-"”) 
1,PD,)? = cD,” 
[Vey,..-, Vey] = (DY71y)' = Vj 


a,(D;7PD2"”) 


i 


and : 
(D71?PD-7'7)¥, = (D;'?PD2"”) (D7*1,) 
= D;'?PI, = D;*r 
Vn 
= : = D1, = Ty 
Vr 
That is, 
(@,,¥,) = (D}/1,, D!1,) (12-36) 


are singular vectors associated with singular value A, = 1. For any correspondence 
matrix, P, the common term in every expansion is 


D!’u,v,D)” = D,1,1;D, = re’ 


Correspondence Analysis 72! 


Therefore, we have established the first approximation and (12-34) can always be 
expressed as 


J 
P=re’ + > A,(D}7t;,) (D1? ¥,)' 
k=2 


Because of the common term, the problem can be rephrased in terms of P — re’ 
and its scaled version D;/*(P — re’) Dz!/2. By the orthogonality of the singular 
vectors of D;'?PD_", we have t,(D!/71,) = 0 and ¥;(D!71,) = 0, for k > 1, so 


J 
DAP — re) DP = By AdieV 


is the singular-value decomposition of D;!(P — re’) Dz!” in terms of the singular val- 
ues and vectors obtained from D;!”PD=!”. Converting to singular values and vectors 
Ay, Uz, and v, from D;!?(P — re’) Dz!” only amounts to changing k to k — 1 so 
A = Ax+1, Ux = Ty41, and Vv, = Veo fork = | PE Aral is 

In terms of the singular value decomposition for D;/2(P — re’) D1”, the ex- 
pansion for P — re’ takes the form 


J=1 
P - re’ = 3 A,(D}/u,) (D1? v,)' (12-37) 
k=1 


The best rank K approximation to DP ~ re’) Dz?” is given by = ApUKV i - 
Then, the best approximation to P — re’ is 


K é 
P — re’ = > A,(D}/u,) (DYv,)’ (12-38) 
k=1 


Remark. Note that the vectors D!/*u, and Dv, in the expansion (12-38) of 
P — rc’ need not have length 1 but satisfy the scaling 


(D/7u,)'D;(D}7u,) = uu, = 1 
(Di? vy)'DE (Dv) = Vive = 1 


Because of this scaling, the expansions in Result 12.1 have been called a generalized 
singular-value decomposition. 

Let A, U = [w,,...,u,;] and V = [v,,..., vy] be the matricies of singular values 
and vectors obtained from D;!7(P — re’) Dz”. It is usual in correspondence 
analysis to plot the first two or three columns of F = D;7(D}?U)A and 
G = D71(D!”V) A or A,D7/7u, and A,D7!/"v, for k = 1, 2, and maybe 3. 

The joint plot of the coordinates in F and G is called a symmetric map (see 
Greenacre [13]) since the points representing the rows and columns have the same 
normalization, or scaling, along the dimensions of the solution. That is, the geometry 
for the row points is identical to the geometry for the column points. 


722 Chapter 12 Clustering, Distance Methods, and Ordination 


Example 12.18 (Calculations for correspondence analysis) Consider the 3 x 2 
contingency table 


Bl B2 Total 


Al 24 12 36 
A2 16 48 64 
A3 60 40 100 


100 100 200 
The correspondence matrix is 
12 .06 
P =| 08 .24 
30.20 


with marginal totals e’ = [.5,.5] and r’ = [.18, .32, .50]. The negative square root 
matrices are 


D712 = diag (V2/.6, V2/.8, V2) Dz!” = diag(v2, V2) 


Then 
12 06 18 03 -.03 
Pre’ =] 08 24/-] 32}[.5 5] =| ~-08 08 
30 = .20 50 05 ~.05 


The scaled version of this matrix is 


2 
M2 5 6 
6 03 -Blri5 4 
A = D7!2(P — re’) Dz? = v2 -08 08 
0 — 0 0 v2 
8 05 -.05 
0 Oo v2 
0.1 —0.1 
=/|-02 02 
01 -0.1 


Since J > J, the square of the singular values and the v, are determined from 


1-1 
1-2 4 06 -.06 
AA= -2 2|\= 
E 2 | : be *| 


Al -.1 


Correspondence Analysis 723 


It is easily checked that A? = .12, A3 = 0, since J — 1 = 1, and that 


ab 

ae v2 

wae eat 

v2 

Further, 

1 -.1 rae 1 02 ~-.04 .02 
AA’ =| —.2 2 24 2 <4 =!|-04 08 —.04 
1 -.1 ; 02 —.04 02 


A computer calculation confirms that the single nonzero eigenvalue is M = .12, 
so that the singular value has absolute value A, = .2V3 and, as you can easily 
check, 


The expansion of P — re’ is then the single term 


A, (D7u;) (D27v;)' 
r 
Me gis gel ae 
v3 Vi cor 
8 2 1 -1 {|} V2 
S18) 0) Soe ee 
vi ° |) Ve E | “ol 
Paes © |e vi 
L Vill V6 
= 
we es 03 ~.03 
= V.12 VA E 3]- ~.08 08 check 
05 -.05 
2 
V3 


724 Chapter 12 Clustering, Distance Methods, and Ordination 


There is only one pair of vectors to plot 


2-5 = Al a 3 
v2 Vo V3 
AD!?u, = V.12| 0 = 0 -5 = Vi2 - 
and 
oo ig ee 1 
0 Vall “3 - 


There is a second way to define contingency analysis. Following Greenacre [13], 
we call the preceding approach the matrix approximation method and the approach 
to follow the profile approximation method. We illustrate the profile approximation 
method using the row profiles; however, an analogous solution results if we were to 
begin with the column profiles. 

Algebraically, the row profiles are the rows of the matrix D;'P, and contin- 
gency analysis can be defined as the approximation of the row profiles by points in 
a low-dimensional space. Consider approximating the row profiles by the matrix P*. 
Using the square-root matrices D!/* and D¥/? defined in (12-31), we can write 


; (D/P - P*)D2? = D;"?(D/7P — D}?P*) D7!” 


and the least squares criterion (12-32) can be written, with Dij = Bij /Tis as 


(ij = Bij) (pi /"; = pi)’ 
22 ye a 
i 7 iCj ie ij 

tr [D77D;/7(D;'P — P*)D27D2(D7P — P*)'] 
tr[D)/2(D;7'7P — D!?P*) D2?D217(D;17P — D¥?P*)'D;)7] 
tr{{(D;!?P — D1?P*) D2"? [(D7)7P — DI?P*)Dz!7)]']_— (12-39) 

Minimizing the last expression for the trace in (12-39) is precisely the first min- 
imization problem treated in the proof of Result 12.1. By (12-34), D;/?pp;!” has 
the singular-value decomposition 


Hl 


Jj 
D??PD2? = > ya (12-40) 
k=1 


The best rank K approximation is obtained by using the first K terms of this expan- 
sion. Since, by (12-39), we have D;’PD2/ approximated by D}/2P*D;"”, we left 


Correspondence Analysis 725 


multiply by D;"” and right multiply by D4? to obtain the generalized singular-value 
decomposition 


D;'P = > 2,.D7)?w,(D!4,)' (12-41) 


where, from (12-36), (1, ¥,) = (D'1,, D?"1,) are singular vectors associated with 
singular value A, = 1. Since D7!7(D}/*1,) = 1, and (D!/1,)'D}? = c’, the leading 
term in the decomposition (12-41) is lye’. 

Consequently, in terms of the singular values and vectors from D7!/2PD;”, the 
reduced rank K < J approximation to the row profiles D7'P is 


P* = 1c’ + = 4.D77a,(D!4,)' (12-42) 


In terms of the singular values and vectors A,,u, and v, obtained from 
D71?(P — re’) Dz", we can write 


K-1 
P* — ec = > A,D7'7u,(D!y,)' 
A 


(Row profiles for the archaeological data in Table 12.8 are shown in Figure 12.21 on 
page 717.) 


Inertia 


Total inertia is a measure of the variation in the count data and is defined as the 
weighted sum of squares 
_ ricj)° 


3a 


(12- -43) 


tr[DPA(P — re!) D27(D;7(P ~ re’) D2?) = ET Pu 
iCj 


where the A, are the singular values obtained from the singular-value decomposi- 
tion of D71/*(P — re’) Dz}/ (see the proof of Result 12.1).3 
The inertia associated with the best reduced rank K < J approximation to the 


centered matrix P.— re’ (the K-dimensional solution) has inertia > Az. The 


k=l 
residual inertia (variation) not accounted for by the rank K solution is equal to the 
sum of squares of the remaining singular values: AX41 + AXa2 + -.. + A3_,. For 


plots, the inertia associated with dimension k, Az, is ordinarily displayed along the 
kth coordinate axis, as in Figure 12.22 for k = 1, 2. 


3Total inertia i : related to the chi-square measure of association in a two-way contingency table, 


Cie i7 Fu 
x= a By - Here Oj; = x;; is the observed frequency and £;; is the expected frequency for 


the iis cell. a our context, if the row variable is independent of (unrelated to) the column variable, 
Ei; = nv,¢,,and 
u (Pi — ney 
Total inertia = > z ele x 


i=] j=1 no n 


726 Chapter 12 Clustering, Distance Methods, and Ordination 


Interpretation in Two Dimensions 
Since the inertia is a measure of the data table's total variation, how do we interpret 
=1 
a large value for the proportion (A7 + A3)/ >, AZ? Geometrically, we say that the 
=I 


associations in the centered data are well represented by points in a plane, and this 
best approximating plane accounts for nearly all the variation in the data beyond 
that accounted for by the rank 1 solution (independence model). Algebraically, we 
say that the approximation 


P - re’ = Ay U1} fe Anu V3 
is very good or, equivalently, that 


P= re’ + A,u,V} + A2u2v5 


Final Comments 


Correspondence analysis is primarily a graphical technique designed to represent 
associations in a low-dimensional space. It can be regarded as a scaling method, and 
can be viewed as a complement to other methods such as multidimensional scaling 
(Section 12.6) and biplots (Section 12.8). Correspondence analysis also has links to 
principal component analysis (Chapter 8) and canonical correlation analysis 
(Chapter 10). The book by Greenacre [14] is one choice for learning more about 
correspondence analysis. 


12.8 Biplots for Viewing Sampling Units and Variables 


A biplot is a graphical representation of the information in an n x p data matrix. 
The bi- refers to the two kinds of information contained in a data matrix. The infor- 
mation in the rows pertains to samples or sampling units and that in the coluinns 
pertains to variables. 

When there are only two variables, scatter plots can represent the information 
on both the sampling units and the variables in a single diagram. This permits the vi- 
sual inspection of the position of one sampling unit relative to another and the rela- 
tive importance of each of the two variables to the position of any unit. 

With several variables, one can construct a matrix array of scatter plots, 
but there is no one single plot of the sampling units. On the other hand, a two- 
dimensional plot of the sampling units can be obtained by graphing the first two 
principal components, as in Section 8.4. The idea behind biplots is to add the infor- 
mation about the variables to the principal component graph. 

Figure 12.23 gives an example of a biplot for the public utilities data in 
Table 12.4. 

You can see how the companies group together and which variables con- 
tribute to their positioning within this representation. For instance, X, = annual 
load factor and Xg = total fuel costs are primarily responsible for the grouping of 
the mostly coastal companies in the lower right. The two variables X, = fixed- 


Biplots for Viewing Sampling Units and Variables 727 


Nev. Po. 
a 
Pug. Sd. Po. 
X6 
al? Idaho Po. 
1 
ir Ok.G.& E. x5 
Tex. Util. Ariz. Pub. Ser. 
a VE x3 
MGB ES hem Co, 
0 
Cent. Louis. 
X2 
San Dieg. cau 
I) Xl Bost. Ed. 
r Flor. Po. & Lt. Pac. G&E 
X4 Unit. lll. Co. 
N. Eng. E). 
2 Cannes: Haw. El. x8 
fe See oN es Ne i ee = a ee 


Figure 12.23 A biplot of the data on public utilities. 


charge ratio and X, = rate of return on capital put the Florida and Louisiana 
companies together. : 


Constructing Biplots 


The construction of a biplot proceeds from the sample principal components. 

According to Result 8A.1, the best two-dimensional approximation to the data 
matrix X approximates the jth observation x j in terms of the sample values of the 
first two principal components. In particular, 

Xx; =x+ Hirer + ¥j2 2 (12-44) 
where €, and é, are the first two eigenvectors of S or, equivalently, of 
XX, = (n ~ 1)S. Here X, denotes the mean corrected data matrix with rows 
(x; — x)’. The eigenvectors determine a plane, and the coordinates of the jth unit 
(row) are the pair of values of the principal components, (31, ¥j2)- 

To include the information on the variables in this plot, we consider the pair of 
eigenvectors (€,, 2). These eigenvectors are the coefficient vectors for the first two 
sample principal components. Consequently, each row of the matrix E = [é1, €2] 


728 Chapter 12 Clustering, Distance Methods, and Ordination 


positions a variable in the graph, and the magnitudes of the coefficients (the coordi- 
nates of the variable) show the weightings that variable has in each principal com- 
ponent. The positions of the variables in the plot are indicated by a vector. Usually, 
statistical computer programs include a multiplier so that the lengths of all of the 
vectors can be suitably adjusted and plotted on the same axes as the sampling units. 
Units that are close to a variable likely have high values on that variable. To inter- 
pret a new point xo, we plot its principal components E’(x9 — x). 

A direct approach to obtaining a biplot starts from the singular value decom- 
position (see Result 2A.15), which first expresses the n X p mean corrected 
matrix X, as 


(nxp) (2%) (pXp) (pXp) 


where A = diag (Ay ,Az,---,A,) and V is an orthogonal matrix whose columns are the 
eigenvectors of XX, = (n — 1)S. That is, V = E = (@,,é2,..., é,]- Multiplying 
(12-45) on the right by E, we find 


X.E=UA (12-46) 
where the jth row of the left-hand side, 
[(x; <4 x)'é,, (x, = x)’'é2,. wars: (x; = x)'é,] = [ya, yr ones Yip 


is just the value of the principal components for the jth item. That is, UA contains all 
of the values of the principal components, while V = E contains the coefficients 
that define the principal components. 

The best rank 2 approximation to X, is obtained by replacing A by 
A* = diag(Az, Az, 0,..., 0). This result, called the Eckart-Young theorem, was es- 
tablished in Result 8.4.1. The approximation is then 


X.= UAtV' = [F152] a (12-47) 


where j, is the n x 1 vector of values of the first principal component and J, is the 
n X 1 vector of values of the second principal component. 

In the biplot, each row of the data matrix, or item, is represented by the point lo- 
cated by the pair of values of the principal components. The ith column of the data 
matrix, or variable, is represented as an arrow from the origin to the point with co- 
ordinates (e1;, e;), the entries in the ith column of the second matrix [@1, é)]' in the 
approximation (12-47). This scale may not be compatible with that of the principal 
components, so an arbitrary multiplier can be introduced that adjusts all of the vec- 
tors by the same amount. 

The idea of a biplot, to represent both units and variables in the same plot, ex- 
tends to canonical correlation analysis, multidimensional scaling, and even more 
complicated nonlinear techniques. (See [12].) 


Biplots for Viewing Sampling Units and Variables 729 


Example 12.19 (A biplot of universities and their characteristics) Table 12.9 gives the 
data on some universities for certain variables used to compare or rank major 
universities. These variables include X, = average SAT score of new freshmen, 
X2 = percentage of new freshmen in top 10% of high school class, X; = percentage 
of applicants accepted, X, = student-faculty ratio, X¥, = estimated annual expens- 
es and X, = graduation rate (%). 

Because two of the variables, SAT and Expenses, are on a much different scale 
from that of the other variables, we standardize the data and base our biplot on the 
matrix of standardized observations z;. The biplot is given in Figure 12.24 on 
page 730. 

Notice how Cal Tech and Johns Hopkins are off by themselves; the variable 
Expense is mostly responsible for this positioning. The large state universities in our 
sample are to the left in the biplot, and most of the private schools are on the right. 


Table 12.9 Data on Universities 

University SAT Topl0 Accept SFRatio Expenses Grad 
Harvard 14.00 91 14 11 39.525 97 
Princeton 13.75 91 14 8 30.220 95 
Yale 13.75 95 19 11 43.514 96 
Stanford 13.60 90 20 12 36.450 93 
MIT 13.80 94 30: 10 34.870 91 
Duke 13.15 90 30 12 31.585 95 
CalTech 14.15 100 . 25 6 63.575 81 
Dartmouth 13.40 89 23 10 32.162 95 
Brown 13.10 89 22 13 22.704 94 
JohnsHopkins 13.05 75 44 7 58.691 87 
UChicago 12.90 75 50 13 38.380 87 
UPenn 12.85 80 36 11 27.553 90 
Cornell 12.80 83 33 13 21.864 90 
Northwestern 12.60 85 39 11 28.052 89 
Columbia 13.10 76 24 12 31.510 88 
NotreDame 12.55 81 42 13 15.122 94 
UVirginia 12.25 77 44 14 13.349 92 
Georgetown 12.55 74 24 12 20.126 92 
CamegieMellon | 12.60 62 59 9 25,026 72 
UMichigan 11.80 65 68 16 15.470 85 
UCBerkeley 12.40 95 40 17 15.140 78 
UWisconsin 10.85 40 69 15 11.857 71 
PennState 10.81 38 54 18 10.185 80 
Purdue 10.05 28 90 19 9.066 69 
TexasA&M 10.75 49. 67 25 8.704 67 
Source: U.S. News & World Report, September 18, 1995, p. 126. 


730 Chapter 12 Clustering, Distance Methods, and Ordination 


Brown 


SFRatio UvVirginia NotreDame 


Georgetown 


UCBerkeley Comell 


TexasA&M PennState 


Purdue UWisconsin Accept 


= i 


Expense 


CamegieMellon 


JohnsHopkins 
CalTech 


-4 _ 2 0 2 
Figure 12.24 A biplot of the data on universities. 


Large values for the variables SAT, Top10, and Grad are associated with the private 
school group. Northwestern lies in the middle of the biplot. a 


A newer version of the biplot, due to Gower and Hand [12], has some advan- 
tages. Their biplot, developed as an extension of the scatter plot, has features that 
make it easier to interpret. 


« The two axes for the principal components are suppressed. 
e An axis is constructed for each variable and a scale is attached. 


As in the original biplot, the i-th item is located by the corresponding pair of 
values of the first two principal components 


(Yais Pr) = ((x; — ¥)'E1,(%; — ¥)'E2) 


where é, and where é, are the first two eigenvectors of §. The scales for the princi- 
pal components are not shown on the graph. 

In addition the arrows for the variables in the original biplot are replaced by 
axes that extend in both directions and that have scales attached. As was the case 
with the arrows, the axis for the i-the variable is determined by the i-the row of 
E= [é,, é&]. 


Biplots for Viewing Sampling Units and Variables 731 


To begin, we let u; the vector with 1 in the i-th position and 0’s elsewhere. Then 
an arbitrary p < 1 vector x can be expressed as 


P 
x= SD xu; 
i=1 


and, by Definition 2.4.12, its projection onto the space of the first two eigenvectors 
has coefficient vector 


is P 2 
~ E’x = > x; (E'u,) 
i=1 


so the contribution of the i-th variable to the vector sum is x; (E'u;) = x; [ei 2)’. 


The two entries ¢); and ¢; in the i-the row of E determine the direction of the axis 
for the i-th variable. 


The projection vector of the sample mean x = Be 


jo Xi 


“ P A 
Ex = 5) x,(E'u,) 
t=1 


is the origin of the biplot. Every x can also be written as x = x + (x — x) and its 
projection vector has two components 


A P i 
Se (Bm) + Ss - zy(Ee) 
i= i= 
Starting from the origin, the points in the direction w/[ej;, e2;]' are plotted for 
w = 0, + 1, + 2,...This provides a scale for the mean centered variable x; — X;. It 
defines the distance in the biplot for a change of one unit in x;. But, the origin for 
the i-th variable corresponds to w = 0 because the term %,(E'u,) was ignored. 
The axis label needs to be translated so that the value x; is at the origin of the biplot. 
Since x; is typically not an integer (or another nice number), an integer (or other 
nice number) closest to it can be chosen and the scale translated appropriately. 
Computer software simplifies this somewhat difficult task. 

The scale allows us to visually interpolate the position of x;[e;, €2;]' in the 
biplot. The scales predict the values of a variable, not give its exact value, as they are 
based on a two dimensional approximation. 


Example 12.20 (An alternative biplot for the university data) We illustrate this 
newer biplot with the university data in Table 12.9. The alternative biplot with an 
axis for each variable is shown in Figure 12.25. Compared with Figure 12.24, the 
software reversed the direction of the first principal component. Notice, for exam- 
ple, that expenses and student faculty ratio separate Cal Tech and Johns Hopkins 
from the other universities. Expenses for Cal Tech and Johns Hopkins can be seen to 
be about 57 thousand a year, and the student faculty ratios are in the single digits. 
The large state universities, on the right hand side of the plot, have relatively high 
student faculty ratios, above 20, relatively low SAT scores of entering freshman, and 
only about 50% or fewer of their entering students in the top 10% of their high 
school class. The scaled axes on the newer biplot are more informative than the 
arrows in the original biplot. a 


732 Chapter 12 Clustering, Distance Methods, and Ordination 


Grad SFRatio 


Top10 


PennState 
4 10 
1L TexasA&M 
SAT 


a 
UWisconsin Purdue 


20 


100 


Accept 


Expenses 
Figure {2.25 An alternative biplot of the data on universities. 


See le Roux and Gardner [23] for more examples of this alternative biplot and 
references to appropriate special purpose statistical software. 


12.9 Procrustes Analysis: A Method 
for Comparing Configurations 


Starting with a given n X n matrix of distances D, or similarities S, that relate n 
objects, two or more configurations can be obtained using different techniques. The 
possible methods include both metric and nonmetric multidimensional scaling. 
The question naturally arises as to how well the solutions coincide. Figures 12.19 and 
12.20 in Example 12.16 respectively give the metric multidimensional scaling 
(principal coordinate analysis) and nonmetric multidimensional scaling solutions 
for the data on universities. The two configurations appear to be quite similar, but a 
quantitative measure would be useful. A numerical comparison of two configura- 
tions, obtained by moving one configuration so that it aligns best with the other, is 
called Procrustes analysis, after the innkeeper Procrustes, in Greek mythology, who 
would either stretch or lop off customers’ limbs so they would fit his bed. 


Procrustes Analysis: A Method for Comparing Configurations 733 


Constructing the Procrustes Measure of Agreement 


Suppose the n x p matrix X* contains the coordinates of the n points obtained for 
plotting with technique 1 and the n X q matrix Y* contains the coordinates from 
technique 2, where q = p. By adding columns of zeros to Y*, if necessary, We can 
assume that X* and Y* both have the same dimension n X p. To determine how 
compatible the two configurations are, we move, say, the second configuration to 
match the first by shifting each point by the same amount and rotating or reflecting 
the configuration about the coordinate axes* 

Mathematically, we translate by a vector b and multiply by an orthogonal 
matrix Q so that the coordinates of the jth point y; are transformed to 


Qy; + b 


The vector b and orthogonal matrix Q are then varied to order to minimize the sum, 
over all n points, of squared distances 


di(x;, Qy; + b) = (x; — Qy; — b)'(x; — Qy; — b) (12-48) 
between x; and the transformed coordinates Qy,; + b obtained for the second tech- 


nique. We take, as a measure of fit, or agreement, between the two configurations, 
the residual sum of squares 


PR? = min S (x; ~ Qu) ~ b)' (x) ~ Qy; ~ b) (12-49) 


The next result shows how to evaluate this Procrustes residual sum of squares mea- 
sure of agreement and determines the Procrustes rotation of Y* relative to X*. 


Result 12.2 Let the n < p configurations X* and Y* both be centered so that all 
columns have mean zero. Then ‘ 


n n D 
PR? = > xix; + DSyy,-2 Dd. 
j=l j=l m1 
= tr[X*X*'] + tr[¥*¥*'] — 2tr[A] (12-50) 
where A = diag(Ai, A,-.-, Ap) and the minimizing transformation is 
Q= s yui=VU' b=0 (12-51) 


i=1 


‘Sibson [30] has proposed a numerical measure of the agreement between two configurations, given 
by the coefficient 
[tr(ve'x*x*'¥*)!27 
tr(X*'X*) tr(Y¥*’ ¥*) 
For identical configurations, y = 0. If necessary, y can be computed after a Procrustes analysis has been 
completed. 


734 Chapter 12 Clustering, Distance Methods, and Ordination 


Here A, U, and V are obtained from the singular-value decomposition 


D yx = Y" x* = U A Vv 
j=l (pxn) (nxp) (px) (pXP) (pXp) 


Proof. Because the configurations are centered to have zero means ( 
n 
and > y; = 0), we have 

Al 


i= 


Dx =0- 


pY (x; — Qy; — b)’ (x; — Qy, — b) = DY (x; — Qy;)’ (x; — Qy;) + ab’b’ 
j= = 


The last term is nonnegative, so the best fit occurs for b = 0 Consequently, we need 
only consider 


n n n n 
min & (x; — Qy;)' (xj ~ Qy) = > xx; + 2 vis) — 2max D xjQy; 
j=1 i= iF i= 
Using x,Qy; = 


tr [ Qy;x;], we find that the expression being maximized becomes 


> xjQy; = > tr[Qy,x;] = “as by 9a] 


By the singular-value decomposition 


n P 
> yx; = ¥*X* = UAV’ = > Amy; 

i=l = 
where U = [w;, u,.. 


-,Up] and V = 
Consequently, 


(V1, ¥2,-.-, ¥p] are p X p orthogonal matrices. 


S xjQy; = tr jo > vai) | = pat 
j=l i=1 


r[Qu;vj] 


The variable quantity in the ith term 


tr[Qu,vj] = v0 


has an upper bound of 1 as can be seen by applying the Cauchy-Schwarz inequality 
(2-48) with b = Qv; and d = n;. That is, since Q is orthogonal 


viQn; = VviQQ'y; Vuln; = Vviv; x 1 


Procrustes Analysis: A Method for Comparing Configurations 735 


Each of these p terms can be maximized by the same choice Q = VU’. With this 
choice, 


Oo 


It 
a 


vQu; = v;VU'u; = (0,...,0,1,0,...,0] 


a) 


oO; 


Therefore, 


~2 max D x/Qy; = —2(Ar + Ay to + Ay) 
j=l 


Finally, we verify that QQ’ = VU’UV' = VI,V’ =I,, so Qis a p X p orthogonal 
matrix, as required. = 


Example 12.21 (Procrustes analysis of the data on universities) Two configurations, 
produced by metric and nonmetric multidimensional scaling, of data on universities 
are given Example 12.16. The two configurations appear to be quite close. There is a 
two-dimensional array of coordinates for each of the two scaling methods. Initially, 
the sum of squared distances is 


25 
>, () — ¥)" (xj ~ yp) = 3.862 
2 


A computer calculation gives 


U= —.9990 .0448 v= —1.0000  .0076 
0448 * .9990 0076 1.0000 


_ | 114.9439 0.000 
0.000 21.3673 


According to Result 12.2, to better align these two solutions, we multiply the non- 
metric scaling solution by the orthogonal matrix 


- + wa _ | 9993 -.0372 
eh alg Bes me 


This corresponds to clockwise rotation of the nonmetric solution by about 


2 degrees. After rotation, the sum of squared distances, 3.862, is reduced to the 
Procrustes measure of fit 


25 25 2 
PR’ = 3) xix; + D yiy) — 2 DA = 3.673 
j=l j=1 j=l 


736 Chapter 12 Clustering, Distance Methods, and Ordination 


Example 12.22 (Procrustes analysis and additional ordinations of data on forests) 
Data were collected on the populations of eight species of trees growing on ten 
upland sites in southern Wisconsin. These data are shown in Table 12.10. 

The metric, or principal coordinate, solution and nonmetric multidimensional 
scaling solution are shown in Figures 12.26 and 12.27. 


Table 12.10 Wisconsin Forest Data 


Site 


a 
— 
oO 


NAOnonmnoo!n 


BurOak 
BlackOak 
WhiteOak 
RedOak 
AmericanElm 
Basswood 
Tronwood 
SugarMaple 


COONWNwWO 

COONS FOO! N 
OOOhDOmMW! Ww 
C2OOMNNONM| & 
CONDON OD!N 
aPADAANL OWN] ~] 
BSENONDHOO] & 
SANNA OOO] MO 
OUANDNWNOO 


Source: See [24]. 


= 0 2 4 


Figure 12.26 Metric multidimensional scaling of the data on forests. 


Procrustes Analysis: A Method for Comparing Configurations 737 


S4 


S7 


-2 0 2 4 


Figure 12.27 Nonmetric multidimensional scaling of the data on forests. 


Using the coordinates of the points in Figures 12.26 and 12.27, we obtain the 
initial sum of squared distances for fit: 


10 
: > (x; — yy)’ (x; — yj) = 8.547 
j= 


A computer calculation gives 


—.9833 ~.1821 -1.0000 ~.0001 
re keer | v-[ d | 


43.3748 0.0000 
0.0000 14.9103 


According to Result 12.2, to better align these two solutions, we multiply the non- 
metric scaling solution by the orthogonal matrix 


geo Se 9833 1821 
— u= U' = 
: > a Bes 9833 


738 Chapter 12 Clustering, Distance Methods, and Ordination 


This corresponds to clockwise rotation of the nonmetric solution by about 10 degrees. 
After rotation, the sum of squared distances, 8.547, is reduced to the Procrustes 
measure of fit 


10 10 2 
PR? = D> xix; + 2 Vii —2 p> A; = 6.599 
i= iF = 


We note that the sampling sites seem to fall along a curve in both pictures. This 
could lead to a one-dimensional nonlinear ordination of the data. A quadratic or 
other curve could be fit to the points. By adding a scale to the curve, we would 
obtain a one-dimensional ordination. 

It is informative to view the Wisconsin forest data when both sampling units and 
variables are shown. A correspondence analysis applied to the data produces the 
plot in Figure 12.28. The biplot is shown in Figure 12.29. 

All of the plots tell similar stories. Sites 1-5 tend to be associated with species of 
oak trees, while sites 7-10 tend to be associated with basswood, ironwood, and sugar 
maples. American elm trees are distributed over most sites, but are more closely 
associated with the lower numbered sites. There is almost a continuum of sites 
distinguished by the different species of trees. = 


5 ‘ 
: 6 
y Iron wood 
BiackOak ‘ 
‘ SugarMaple 
4 : 
BurOak ‘ 8 
eSteech a doedcsed bets oe ote goes AmericanElv... 2.222 222- 2.2 20e ee ee eee eee eee eee eee eee 
‘ 7 Basswood 
3 
1 t 
g WhiteOak 
Ze } 10 
‘ 9 
RedOak 
| | a | 
-2 st 0 1 2 


Figure 12.28 The correspondence analysis plot of the data on forests. 


Procrustes Analysis: A Method for Comparing Configurations 


3 
2 
3 9 
17-1 4 BlackOak 10 
0 
cle WhiteOak 
6 
2 — 
RedOak 
5 
| | l | | 
-2 ~1 0 1 2 3 


Figure 12.29 The biplot of the data on forests. 


739 


Supplement 


DATA MINING 


Introduction 


A very large sample in applications of traditional statistical methodology may mean 
10,000 observations on, perhaps, 50 variables. Today, computer-based repositories 
known as data warehouses may contain many terabytes of data. For some organiza- 
tions, corporate data have grown by a factor of 100,000 or more over the last few 
decades. The telecommunications, banking, pharmaceutical, and (package) shipping 
industries provide several examples of companies with huge databases. Consider the 
following illustration. If each of the approximately 17 million books in the Library 
of Congress contained a megabyte of text (roughly 450 pages) in MS Word format, 
then typing this collection of printed material into a computer database would con- 
sume about 17 terabytes of disk space. United Parcel Service (UPS) has a package- 
level detail database of about 17 terabytes to track its shipments. 

For our purposes, data mining refers to the process associated with discovering 
patterns and relationships in extremely large data sets. That is, data mining is 
concerned with extracting a few nuggets of knowledge from a relative mountain of 
numerical information. From a business perspective, the nuggets of knowledge rep- 
resent actionable information that can be exploited for a competitive advantage. 

Data mining is not possible without appropriate software and fast computers. Not 
surprisingly, many of the techniques discussed in this book, along with algorithms de- 
veloped in the machine learning and artificial intelligence fields, play important roles 
in data mining. Companies with well-known statistical software packages now offer 
comprehensive data mining programs. In addition, special purpose programs such as 
CART have been used successfully in data mining applications. 

Data mining has helped to identify new chemical compounds for prescription 
drugs, detect fraudulent claims and purchases, create and maintain individual 
customer relationships, design better engines and build appropriate inventories, 
create better medical procedures, improve process control, and develop effective 
credit scoring rules. 


>SAS Institute’s data mining program is currently called Enterprise Miner. SPSS’s data mining 
program is Clementine. 


740 


Data Mining 741 


In traditional statistical applications, sample sizes are relatively small, data are 
carefully collected, sample results provide a basis for inference, anomalies are 
treated but are often not of immediate interest, and models are frequently highly 
structured. In data mining, sample sizes can be huge; data are scattered and histori- 
cal (routinely recorded), samples are used for training, validation, and testing (no 
formal inference); anomalies are of interest; and models are often unstructured. 
Moreover, data preparation—including data collection, assessment and cleaning, 
and variable definition and selection—is typically an arduous task and represents 60 
to 80% of the data mining effort. 

Data mining problems can be roughly classified into the following categories: 


¢ Classification (discrete outcomes): 
Who is likely to move to another cellular phone service? 
¢ Prediction (continuous outcomes): 
What is the appropriate appraised value for this house? 
¢ Association/market basket analysis: 
Is skim milk typically purchased with low-fat cottage cheese? 
¢ Clustering: 
Are there groups with similar buying habits? 
« Description. 


On Thursdays, grocery store consumers often purchase corn chips and soft 
drinks together. 


Given the nature of data mining problems, it should not be surprising that many of 
the statistical methods discussed in this book are part of comprehensive data mining 
software packages. Specifically, regression, discrimination and classification proce- 
dures (linear rules, logistic regression, decision trees such as those produced by 
CART), and clustering algorithms are important data mining tools. Other tools, 
whose discussion is beyond the scope of this book, include association rules, multi- 
variate adaptive regression splines (MARS), K-nearest neighbor algorithm, neural 
networks, genetic algorithms, and visualization.® 


The Data Mining Process 


Data mining is a process requiring a sequence of steps. The steps form a strategy 
that is not unlike the strategy associated with any model building effort. Specifically, 
data miners must 

1. Define the problem and identify objectives. 

2. Gather and prepare the appropriate data. 


3. Explore the data for suspected associations, unanticipated characteristics, and 
obvious anomalies to gain understanding. 


4. Clean the data and perform any variable transformation that seems appropriate. 


®For more information on data mining in general and data mining tools in particular, see the refer- 
ences at the end of this chapter. 


742 Chapter 12 Clustering, Distance Methods, and Ordination 


5. Divide the data into training, validation, and, perhaps, test data sets. 
6. Build the model on the training set. 
7. Modify the model (if necessary) based on its performance with the validation data. 


8. Assess the model by checking its performance on validation or test data. 
Compare the model outcomes with the initial objectives. Is the model likely to 
be useful? 


9. Use the model. 
10. Monitor the model performance. Are the results reliable, cost effective? 


In practice, it is typically necessary-to repeat one of more of these steps several 
times until a satisfactory solution is achieved. Data mining software suites such as 
Enterprise Miner and Clementine are typically organized so that the user can work 
sequentially through the steps listed and, in fact, can picture them on the screen asa 
process flow diagram. 

Data mining requires a rich collection of tools and algorithms used by a skilled 
analyst with sound subject matter knowledge (or working with someone with sound 
subject matter knowledge) to produce acceptable results. Once established, any suc- 
cessful data mining effort is an ongoing exercise. New data must be collected and 
processed, the model must be updated or a new model developed, and, in general, 
adjustments made in light of new experience. The cost of a poor data mining effort 
is high, so careful model construction and evaluation is imperative. 


Model Assessment 


In the model development stage of data mining, several models may be examined 
simultaneously. In the example to follow, we briefly discuss the results of applying 
logistic regression, decision tree methodology, and a neural network to the problem 
of credit scoring (determining good credit risks) using a publicly available data set 
known as the German Credit data. Although the data miner can control the model 
inputs and certain parameters that govern the development of individual models, in 
most data mining applications there is little formal statistical inference. Models are 
ordinarily assessed (and compared) by domain experts using descriptive devices 
such as confusion matrices, summary profit or loss numbers, lift charts, threshold 
charts, and other, mostly graphical, procedures. 

The split of the very large initial data set into training, validation, and testing. 
subsets allows potential models to be assessed with data that were not involved in 
model development. Thus, the training set is used to build models that are assessed 
on the validation (holdout) data set. If a model does not perform satisfactorily in the 
validation phase, it is retrained. Iteration between training and validation continues 
until satisfactory performance with validation data is achieved. At this point, a 
trained and validated model is assessed with test data. The test data set is ordinarily 
used once at the end of the modeling process to ensure an unbiased assessment of 
model performance. On occasion, the test data step is omitted and the final assess- 
ment is done with the validation sample, or by cross-validation. 

An important assessment tool is the Jift chart. Lift charts may be formatted in 
various ways, but all indicate improvement of the selected procedures (models) over 
what can be achieved by a baseline activity. The baseline activity often represents a 


Data Mining 743 


prior conviction or a random assignment. Lift charts are particularly useful for 
comparing the performance of different models. 
Lift is defined as 
ee P(result! condition) 
P(result) 


If the result is independent of the condition, then Lift = 1. A value of Lift > 1 
implies the condition (generally a model or algorithm) leads to a greater probabili- 
ty of the desired result and, hence, the condition is useful and potentially profitable. 
Different conditions can be compared by comparing their lift charts. 


Example 12.23 (A small-scale data mining exercise) A publicly available data set 
known as the German Credit data’ contains observations on 20 variables for 1000 
past applicants for credit. In addition, the resulting credit rating (“Good” or “Bad”) 
for each applicant was recorded. The objective is to develop a credit scoring rule 
that can be used to determine if a new applicant is a good credit risk or a bad 
credit risk based on values for one or more of the 20 explanatory variables. 
The 20 explanatory variables include CHECKING (checking account status), 
DURATION (duration of credit in months), HISTORY (credit history), AMOUNT 
(credit amount), EMPLOYED (present employment since), RESIDENT (present 
resident since), AGE (age in years), OTHER (other installment debts), INSTALLP 
(installment rate as % of disposable income), and so forth. Essentially, then, we 
must develop a function of several variables that allows us to classify a new appli- 
cant into one of two categories: Good or Bad. 

We will develop a classification procedure using three approaches discussed in 
Sections 11.7 and 11.8; logistic regression, classification trees, and neural networks. 
An abbreviated assessment of the three approaches will allow us compare the per- 
formance of the three approaches on a validation data set. This data mining exercise 
is implemented using the general data mining process described earlier and SAS 
Enterprise Miner software. 

In the full credit data set, 70% of the applicants were Good credit risks and 30% 
of the applicants were Bad credit risks. The initial data were divided into two sets for 
our purposes, a training set and a validation set. About 60% of the data (581 cases) 
were allocated to the training set and about 40% of the data (419 cases) were allo- 
cated to the validation set. The random sampling scheme employed ensured that 
each of the training and validation sets contained about 70% Good applicants and 
about 30% Bad applicants. The applicant credit risk profiles for the data sets follow. 


Credit data Training data Validation data 
Good: 700 401 299 
Bad: 300 180 120 
Total: 1000 ; 581 419 


= 


TAt the time this supplement was written, the German Credit data were available in a sample data 
“file accompanying SAS Enterprise Miner. Many other publicly available data sets can be downloaded 
from the following Web site: www.kdnuggets.com. 


744 Chapter 12 Clustering, Distance Methods, and Ordination 


SAMPS10. 
DMAGECR 


Variables —— 
Distribution 
Explorer 


Neural SAMPS10. 
Network DMAGESCR 


Figure 12.30 The process flow diagram. 


Figure 12.30 shows the process flow diagram from the Enterprise Miner screen. 
The icons in the figure represent various activities in the data mining process. As 
examples, SAMPS10.DMAGECR contains the data; Data Partition allows the data 
to be split into training, validation, and testing subsets; Transform Variables, as the 
name implies, allows one to make variable transformations; the Regression, Tree, 
and Neural Network icons can each be opened to develop the individual models; 
and Assessment allows an evaluation of each predictive model in tenns of predictive 
power, lift, profit or loss, and so on, and a comparison of all models. 

The best model (with the training set parameters) can be used to score a new 
selection of applicants without a credit designation (SAMPS10. DMAGESCR). The 
results of this scoring can be displayed, in various ways, with Distribution Explorer. 

For this example, the prior probabilities were set proportional to the data; con- 
sequently, P(Good) = .7 and P(Bad) = .3. The cost matrix was initially specified 
as follows: 


Predicted (Decision) 
Good (Accept) Bad (Reject) 
$1 

0 


Good 0 
Actual Bad $5 


so that it is 5 times as costly to classify a Bad applicant as Good (Accept) as it is to 
classify a Good applicant as Bad (Reject). In practice, accepting a Good credit risk 
should result in a profit or, equivalently, a negative cost. To match this formulation 
more closely, we subtract $1 from the entries in the first row of the cost matrix to 
obtain the “realistic” cost matrix: 


Predicted (Decision) 
Good (Accept) Bad (Reject) 


Good —$1 0 
Actual Bad $5 0 


Data Mining 745 


This matrix yields the same decisions as the original cost matrix, but the results are 
easier to interpret relative to the expected cost objective function. For example, 
after further adjustments, a negative expected cost score may indicate a potential 
profit so the applicant would be a Good credit risk. 

Next, input variables need to be processed (perhaps transformed), models (or 
algorithms) must be specified, and required parameters must be set in all of the icons in 
the process flow diagram. Then the process can be executed up to any point in the dia- 
gram by clicking on an icon. All previous connected icons are run. For example, clicking 
on Score executes the process up to and including the Score icon. Results associated 
with individual icons can then be examined by clicking on the appropriate icon. 

We illustrate model assessment using lift charts. These lift charts, available in 
the Assessment icon, result from one execution of the process flow diagram in 
Figure 12.30. 

Consider the logistic regression classifier. Using the logistic regression function 
determined with the training data, an expected cost can be computed for each case 
in the validation set. These expected cost “scores” can then ordered from smallest to 
largest and partitioned into groups by the 10th, 20th,..., and 90th percentiles. The 
first percentile group then contains the 42 (10% of 419) of the applicants with the 
smallest negative expected costs (largest potential profits), the second percentile 
group contains the next 42 applicants (next 10%), and so on. (From a classification 
viewpoint, those applicants with negative expected costs might be classified as Good 
risks and those with nonnegative expected costs as Bad risks.) 

If the model has no predictive power, we would expect, approximately, a uni- 
form distribution of, say, Good credit risks over the percentile groups. That is, we 
would expect 10% or .10(299) = 30 Good credit risks among the 42 applicants in 
each of the percentile groups. 

Once the validation data have been scored, we can count the number of Good 
credit risks (of the 42 applicants) actually falling in each percentile group. For 
example, of the 42 applicants in the first percentile group, 40 were actually Good 
risks for a “captured response rate” of 40/299 = .133 or 13.3%. In this case, lift for 
the first percentile group can be calculated as the ratio of the number of Good 
predicted by the model to the number of Good from a random assignment or 


40 
Lift = = = 133 
a 36 


The lift value indicates the model assigns 10/299 = .033 or 3.3% more Good risks 
to the first percentile group (largest negative expected cost) than would be assigned 
by chance. 

Lift statistics can be displayed as individual (noncumulative) values or as cumu- 
lative values. For example, 40 Good risks also occur in the second percentile group 
for the logistic regression classifier, and the cumulative risk for the first two per- 
centile groups is 


8 The lift numbers calculated here differ a bit from the numbers displayed in the lift diagrams to fol- 
low because of rounding. 


746 Chapter 12 Clustering, Distance Methods, and Ordination 


20 40 60 80 100 


: Percentile 
Tool Name ; : 
¢ Figure 12.31 Cumulative lift 
Ei Baseline Ml Reg : chart for the logistic regression 


classifier. 


The cumulative lift chart for the logistic regression model is displayed in Figure 12.31. 

Lift and cumulative lift statistics can be determined for the classification tree 
tool and for the neural network tool. For each classifier, the entire data set is scored 
(expected costs computed), applicants ordered from smallest score to largest score 
and percentile groups created. At this point, the lift calculations follow those out- 
lined for the logistic regression method. The cumulative charts for all three classi- 
fiers are shown in Figure 12.32. 


Figure 12.32 Cumulative lift 
charts for neural network, 
classification tree, and logistic 
regression tools. 


Exercises 


Exercises 747 


We see from Figure 12.32 that the neural network and the logistic regression 
have very similar predictive powers and they both do better, in this case, than the 
classification tree. The classification tree, in turn, outperforms a random assignment. 
If this represented the end of the model building and assessment effort, one model 
would be picked (say, the neural network) to score a new set of applicants (without 
a credit risk designation) as Good (accept) or Bad (reject). 

In the decision flow diagram in Figure 12.30, the SAMPS10.DMAGESCR file 
contains 75 new applicants. Expected cost scores for these applicants were created 
using the neural network model. Of the 75 applicants, 33 were classified as Good 
credit risks (with negative expected costs). = 


Data mining procedures and software continue to evolve, and it is difficult to 
predict what the future might bring. Database packages with embedded data mining 
capabilities, such as SQL Server 2005, represent one evolutionary direction. 


12.1. 


Certain characteristics associated with a few recent U.S. presidents are listed in Table 12.11. 


Table 12.11 
Birthplace Elected Prior U.S. 
(region of first congressional Served as 
President United States) term? Party experience? —_vice president? 
1. R. Reagan Midwest Yes Republican No No 
2. J. Carter South Yes Democrat No No 
3. G. Ford Midwest No Republican Yes Yes 
4. R. Nixon West Yes Republican Yes Yes 
5. L. Johnson South No Democrat Yes Yes 
6. J. Kennedy East Yes Democrat Yes No 
(a) Introducing appropriate binary variables, calculate similarity coefficient 1 in 
Table 12.1 for pairs of presidents. 
Hint: You may use birthplace as South, non-South. 
(b) Proceeding as in Part a, calculate similarity coefficients 2 and 3 in Table 12.1 Verify 
the monotonicity relation of coefficients 1,2, and 3 by displaying the order of the 15 
similarities for each coefficient. 
12.2. Repeat Exercise 12.1 using similarity coefficients 5,6, and 7 in Table 12.1. 
12.3. Show that the sample correlation coefficient [see (12-11)] can be written as 


ad — be 
[(a + b)(a + c)(b + d)(c + a)]? 


for two 0-1 binary variables with the following frequencies: 


Variable 2 


Variable 1 


} Chapter 12 Clustering, Distance Methods, and Ordination 


12.4. 


12.5. 


12.6. 


12.7. 


12.8. 


Show that the monotonicity property holds for the similarity coefficients 1, 2, and 3 jn 
Table 12.1. 
Hint: (b + c) = p — (a + d). So, for instance, 
at+d e. 1 
at+d+2(b+c) 1+2[p/(a+d) - 1] 
This equation relates coefficients 3 and 1. Find analogous representations for the other 
pairs. 


Consider the matrix of distances 


123 4 
1 0 
2 1 0 
3 ]11 2 0 


4 5 3 4 0 
Cluster the four items using each of the following procedures. 
(a) Single linkage hierarchical procedure. 
(b) Complete linkage hierarchical procedure. 
(c) Average linkage hierarchical procedure. 
Draw the dendrograms and compare the results in (a), (b), and (c). 


The distances between pairs Of five items are as follows: 


12 3 4 5 
1 | 0 

2 14 0 

3 |6 9 O 

4 11 7 10 0 


5 L6 3 5 8 0 
Cluster the five items using the single linkage, complete linkage, and average linkage hi- 
erarchical methods. Draw the dendrograms and compare the results. 


Sample correlations for five stocks were given in Example 8.5. These correlations, 
rounded to two decimal places, are reproduced as follows: 


JP Wells Royal Exxon 
Morgan Citibank Fargo DutchShell Mobil 
JP Morgan 1 
Citibank 63 1 
Wells Fargo 51 57 1 
Royal DutchShell 12 32 18 1 
ExxonMobil -16 21 15 68 1 


Treating the sample correlations as similarity measures, cluster the stocks using the sin- 
gle linkage and complete linkage hierarchical procedures. Draw the dendrograms and 
compare the results. 


Using the distances in Example 12.3, cluster the items using the average linkage 
hierarchical procedure. Draw the dendrogram. Compare the results with those in 
Examples 12.3 and 12.5. 


Exercises 749 


12.9. The vocabulary “richness” of a text can be quantitatively described by counting the 
words used once, the words used twice, and so forth. Based on these counts, a linguist 
proposed the following distances between chapters of the Old Testament book Lamenta- 
tions (data courtesy of Y.T. Radday and M. A. Pollatschek): 


Lamentations 
chapter 
1 2 3 4 5 
0 
Lamentations 76 0 
chapter 2.97 80 0 


488 4.17 21 0 
3.86 1.92 1.51 Si 0 


A bk wWN 


Cluster the chapters of Lamentations using the three linkage hierarchical methods we 
have discussed. Draw the dendrograms and compare the results. 


{2.10. Use Ward’s method to cluster the four items whose measurements on a single variable X 
are given in the following table. 


Measurements 


Item x 


2 
1 
5 
8 


aWONe 


(a) Initially, each item is a cluster and we have the clusters 


{1} {2} {3} {4} 
Show that ESS = 0, as it must. 
(b) If we join clusters {1} and {2}, the new cluster {12} has 


ESS, = 5 (a; - £)’ = (2-15)? + (1 - 15)? = 5 


and the ESS associated with the grouping {12}, {3}, {4} is ESS= 5 
+ 0+ 0 = .S. The increase in ESS (loss of information) from the first step to the 
current step in .5 — 0 = .5. Complete the following table by determining the in- 
crease in ESS for all the possibilities at step 2. 


Increase 
Clusters in ESS 


(c) Complete the last two algamation steps, and construct the dendrogram showing the 
values of ESS at which the mergers take place. 


750 Chapter 12 Clustering, Distance Methods, and Ordination 


12.11. 


12.12. 


12.13. 


12.14. 


12.15. 


12.16. 


12.17, 


12.18. 


12.19. 


Suppose we measure two variables X, and X> for four items A, B, C, and D. The data are 
as follows: 


Observations 


Use the K-means clustering technique to divide the items into K = 2 clusters. Start with 
the initial groups (AB) and (CD). 

Repeat Example 12.11, starting with the initial groups (AC) and (BD). Compare your 
solution with the solution in the example. Are they the same? Graph the items in terms 
of their (x; , x2) coordinates, and comment on the solutions. 

Repeat Example 12.11, but start at the bottom of the list of items, and proceed up in the 
order D, CG B, A. Begin with the initial groups (A#) and (CD). [The first potential reas- 
signment will be based on the distances d*(D, (AB)) and d*(D, (CD)).] Compare your 
solution with the solution in the example. Are they the same? Should they be the same? 


The following exercises require the use of a computer. 
Table 11.9 lists measurements on 8 variables for 43 breakfast cereals. 


(a) Using the data in the table, calculate the Euclidean distances between pairs of cereal 
brands. 

(b) Treating the distances calculated in (a) as measures of (dis)similarity, cluster the 
cereals using the single linkage and complete linkage hierarchical procedures. 
Construct dendrograms and compare the results. 

Input the data in Table 11.9 into a K-means clustering program. Cluster the cereals into 

K = 2,3, and 4 groups. Compare the results with those in Exercise 12.14. 


The national track records data for women are given in Table 1.9. 


(a) Using the data in Table 1.9, calculate the Euclidean distances between pairs of 
countries. 


(b) Treating the distances in (a) as measures of (dis)similarity, cluster the countries using 
the single linkage and complete linkage hierarchical procedures. Construct dendro- 
grams and compare the results. 


(c) Input the data in Table 1.9 into a K-means clustering program. Cluster the countries 
into groups using several values of K. Compare the results with those in Part b. 


Repeat Exercise 12.16 using the national track records data for men given in Table 8.6. 
Compare the results with those of Exercise 12.16. Explain any differences. 

Table 12.12 gives the road distances between 12 Wisconsin cities and cities in neighboring 
states. Locate the cities ing = 1,2, and 3 dimensions using multidimensional scaling. Plot 
the minimum stress (q) versus g and interpret the graph. Compare the two-dimensional 
multidimensional scaling configuration with the locations of the cities on a map from an 
atlas. 


Table 12.13 on page 752 gives the “distances” between certain archaeological sites 
from different periods, based upon the frequencies of different types of potsherds found 
at the sites. Given these distances, determine the coordinates of the sites in q = 3,4, 
and 5 dimensions using multidimensional scaling. Plot the minimum stress (q) versus 4 


0 S6E p8I SLU Lov O€I €6 912 OPT EIT 16 981 (ZT) 
0 vlZ SLI col 682 @E 191 Bcz 82 poe “sts (TT) 
0 Siz IS€ 19 gol 981 56 611 6 961 (OT) 
0 €2Z 9gT 181 Sp 6£1 991 98T 16 (6) 
0 Z9E p6E 612 OEE 6SE LLE ste (8) 
0 Lol OLI Ly 8¢ ce ‘ort (z) 
0 p81 LL oS el oor (9) 
0 SEL p91 Sg cor = (S) 
0 9€ os zor (b) 
0 €€ 86 (€) 
0 Oct (2) 
0 (WD) 
(21) (TT) (OT) (6) (8) (Z) (9) ($) (r) (e—) . (@) (1) 
oseorya =meg'ig enbnqnq nesney JOUsdng soup soynem pleyysieyy uosipeyy uosunyy *WoOfeEg uojojddy 
Hoy 
$91 B1$ FULIOQYBIAN Ul SATII PUR UISUODSIAA Ul SONID UdOMIOg SooURISICY ZI'ZI SIQeL 


al 


“JONOML ‘f W Jo Asouinog ejeq :901n0g 
“UNOS OS pue “LETT G'v Porep E61 AWS 02 SIOJEI [ELL E61d ‘8160 ‘A'V PAIEP Y6Id AIS 02 SI9JFOI 816086Id ATM 


0 166'T 6IL'T S00'T 189°0 Sort 8SET 22L'T 809'T (6) 
. 0 LO?’ 950°% €96'T 066'T 986'T S07 950°7 (g) 
. : 0 916'0 ZOU'T 189°0 ZSb'0 98T'Z r160 (L) 
. : . 0 6€S°0 6190 61L0 0L0°7 PI60 (9) 
. : . . 0 Trs'0 61L'0 OL8'T ZT (s) 
: . . . , 0 €£7'0 61 SOl'T (p) 
. . . . . . 0 S%'Z rO0'T (€) 
: : . . . . . 0 207'T (2) 
‘ ‘ - . 3 < ‘. ‘ 0 (1) 
(6) (g) (1) (9) (s) (p) (€) (2) (1) 


COO TOE TA LETITETd Sb60PETA SOOTSETd veOT9ETd L860€S Td 0960SSTd TELTE6Td 816086Td 
SOU [BosO[OseYIIY UdaMIog S2OURISIC. E1ZI BIQeL 


752 


Exercises 753 


and interpret the graph. If possible, locate the sites in two dimensions (the first two 
principal components) using the coordinates for the g = 5-dimensional solution. (Treat 
the sites as variables.) Noting the periods associated with the sites, interpret the two- 
dimensional configuration. 


12.20. A sample of n = 1660 people is cross-classified according to mental health status and 
socioeconomic status in Table 12.14. 
Perform a correspondence analysis of these data. Interpret the results. Can the asso- 
ciations in the data be well represented in one dimension? 


12.21. A sample of 901 individuals was cross-classified according to three categories of income 
and four categories of job satisfaction-The results are given in Table 12.15. 
Perform a correspondence analysis of these data. Interpret the results. 


12.22. Perform a correspondence analysis of the data on forests listed in Table 12.10, and verify 
Figure 12.28 given in Example 12.22. 


12.23. Construct a biplot of the pottery data in Table 12.8. Interpret the biplot. Is the biplot con- 
sistent with the correspondence analysis plot in Figure 12.22? Discuss your answer. (Use 
the row proportions as a vector of observations at a site.) 


12.24. Construct a biplot of the mental health and socioeconomic data in Table 12.14. Interpret 
the biplot. Is the biplot consistent with the correspondence analysis plot in Exercise 
12.20? Discuss your answer. (Use the column proportions as the vector of observations 
for each status.) 


Table 12.14 Mental Health Status and Socioeconomic Status Data 


Parental Socioeconomic Status 


A (High) B Cc D_ E (Low) 


Mental Health Status 


Well 
Mild symptom formation 

Moderate symptom formation 
Impaired 


Source: Adapted from data in Srole, L., T. S. Langner, S.T. Michael, P. Kirkpatrick, M. K. Opler, and 
T.A.C. Rennie, Mental Health in the Metropolis: The Midtown Manhatten Study, rev. ed. (New York: NYU 
Press, 1978). 


Table 12.15 Income and Job Satisfaction Data 
Job Satisfaction 
Very Somewhat Moderately Very 
Income dissatisfied dissatisfied satisfied satisfied 
< $25,000 42 62 184 207 
$25 ,000-$50,000 13 28 81 113 
> $50,000 7 18 


Source: Adapted from data in Table 8.2 in Agresti, A., Categorical Data Analysis (New York: John 
Wiley, 1990). 


754 Chapter 12 Clustering, Distance Methods, and Ordination 


12.25. 


12.26. 


12.27, 


12.28. 


12.29. 


12.30. 


12.31. 


Using the archaeological data in Table 12.13, determine the two-dimensional metric and 
nonmetric multidimensional scaling plots. (See Exercise 12.19.) Given the coordinates of 
the points in each of these plots, perform a Procrustes analysis. Interpret the results. 


Table 8.7 contains the Mali family farm data (see Exercise 8.28). Remove the outliers 25, 
34, 69 and 72, leaving at total of n = 72 observations in the data set. Treating the 
Euclidean distances between pairs of farms as a measure of similarity, cluster the farms 
using average linkage and Ward’s method. Construct the dendrograms and compare the 
results. Do there appear to be several distinct clusters of farms? 


Repeat Exercise 12.26 using standardized observations. Does it make a difference 
whether standardized or unstandardized observations are used? Explain. 


Using the Mali family farm data in Table 8.7 with the outliers 25, 34,69 and 72 removed, 


-cluster the farms with the K-means clustering algorithm for K = 5 and K = 6, 


Compare the results with those in Exercise 12.26. Is 5 or 6 about the right number of dis- 
tinct clusters? Discuss. 

Repeat Exercise 12.28 using standardized observations. Does it make a difference 
whether standardized of unstandardized observations are used? Explain. 


A company wants to do a mail marketing campaign. It costs the company $1 for each 
itém mailed. They have information on 100,000 customers. Create and interpret a cumu- 
lative lift chart from the following information. 


Overall Response Rate: Assume we have no model other than the prediction of the 
overall response rate which is 20%. That is, if all 100,000 
customers are contacted (at a cost of $100,000), we will re- 
ceive around 20,000 positive responses. 


Results of Response Model: A response model predicts who will respond to a 
marketing campaign. We use the response model to as- 
sign a score to all 100,000 customers and predict the 
positive responses from contacting only the top 
10,000 customers, the top 20,000 customers, and so 
forth. The model predictions are summarized below. 


Cost Total Customers Positive 
($) Contacted Responses 
10000 10000 6000 
20000 20000 10000 
30000 30000 13000 
40000 40000 15800 
50000 50000 ~ 17000 
60000 60000 18000 
70000 70000 18800 
80000 80000 19400 
90000 90000 19800 
100000 . 100000 20000 


Consider the crude-oil data in Table 11.7. Transform the data as in Example 11.14. Ignore 

the known group membership. Using the special purpose software MCLUST, 

(a) select a mixture model using the BIC criterion allowing for the different covariance 
structures listed in Section 12.5 and up to K = 7 groups. 

(b) compare the clustering results for the best model with the known classifications 
given in Example 11.14. Notice how several clusters correspond to one crude-oil 
classification. 


References 755 


References 


> WwW 


21. 


22. 


23. 


. Abramowitz, M., and I. A. Stegun, eds. Handbook of Mathematical Functions. US. 


Department of Commerce, National Bureau of Standards Applied Mathematical Series. 
55, 1964. 


. Adriaans, P.,and D. Zantinge. Data Mining. Harlow, England: Addison-Wesley, 1996. 
. Anderberg, M. R. Cluster Analysis for Applications. New York: Academic Press, 1973. 
. Berry, M. J. A., and G, Linoff. Data Mining Techniques: For Marketing, Sales and 


Customer Relationship Management (2nd ed.) (paperback). New York: John Wiley, 2004. 


. Berthold, M., and D. J. Hand. Zmtelligent Data Analysis (2nd ed.). Berlin, Germany: 


Springer-Verlag, 2003. 


. Celeux, G., and G. Govaert. “Gaussian Parsimonious Clustering Models.” Pattern 


Recognition, 28 (1995), 781-793. 


. Cormack, R. M. “A Review of Classification (with discussion).” Journal of the Royal 


Statistical Society (A), 134, no.3 (1971), 321-367. 


. Everitt, B. S., S. Landau and M. Leese. Cluster Analysis (4th ed.). London: Hodder 
“Arnold, 2001. 


. Fraley, C., and A. E. Raftery. “Model-Based Clustering, Discriminant Analysis and 


Density Estimation.” Journal of the American Statistical Association, 97 (2002), 611-631. 


. Gower, J. C. “Some Distance Properties of Latent Root and Vector Methods Used in 


Multivariate Analysis.” Biometrika, 53 (1966), 325~338. 


. Gower, J. C. “Multivariate Analysis and Multidimensional Geometry.” The Statistician, 


17 (1967), 13-25. 


. Gower, J. C., and D. J. Hand. Biplots. London: Chapman and Hall, 1996. 
- Greenacre, M. J. “Correspondence Analysis of Square Asymmetric Matrices,” Applied 


Statistics, 49, (2000) 297-310. 


. Greenacre, M. J. Theory and Applications of Correspondence Analysis. London: 


Academic Press, 1984. 


. Hand, D., H. Mannila, and P. Smyth. Principles of Data Mining. Cambridge, MA: MIT 


Press, 2001. 


. Hartigan, J.A. Clustering Algorithms. New York: John Wiley, 1975. 
. Hastie, T. R., R. Tibshirani and J. Friedman. The Elements of Statistical Learning: Data 


Mining, Inference and Prediction. Berlin, Germany: Springer-Verlag, 2001. 


. Kennedy, R.L., L. Lee, B. Van Roy, C. D. Reed, and R. P. Lippmann. Solving Data Mining 


Problems Through Pattern Recognition. Upper Saddle River, NJ: Prentice-Hall, 1997. 


. Kruskal, J. B. “Multidimensional Scaling by Optimizing Goodness of Fit to a Nonmetric 


Hypothesis.” Psychometrika, 29, no. 1 (1964), 1-27. 


. Kruskal, J. B. “Non-metric Multidimensional Scaling: A Numerical Method.” Psychome- 


trika,29, no. 1 (1964), 115-129. 

Kruskal, J. B., and M. Wish. “Multidimensional Scaling.” Sage University Paper Series on 
Quantitative Applications in the Social Sciences, 07-011. Beverly Hills and London: 
Sage Publications, 1978. 

LaPointe, F-J, and P. Legendre.“A Classification of Pure Malt Scotch Whiskies.” Applied 
Statistics, 43, no. 1 (1994), 237-257. 

le Roux, N. J.,and S. Gardner. “Analysing Your Multivariate Data as a Pictorial: A Case 
for Applying Biplot Methodology.” International Statistical Review, 73 (2005), 365-387. 


‘6 Chapter 12 Clustering, Distance Methods, and Ordination 


24. 


25. 


26. 


27. 


28. 
29. 


30. 


31. 


32. 


33. 


34. 


35. 


Ludwig, J. A., and J. F. Reynolds. Statistical Ecology—a Primer on Methods and 
Computing. New York: Wiley-Interscience, 1988. 

MacQueen, J. B. “Some Methods for Classification and Analysis of Multivariate 
Observations.” Proceedings of 5th Berkeley Symposium on Mathematical Statistics and 
Probability, 1, Berkeley, CA: University of California Press (1967), 281-297, 

Mardia, K. V., J.T. Kent, and J. M. Bibby. Multivariate Analysis (Paperback). London: 
Academic Press, 2003. 

Morgan, B. J.T, and A. P. G. Ray. “Non-uniqueness and Inversions in Cluster Analysis,” 
Applied Statistics, 44, no. 1 (1995), 117-134. 

Pyle, D. Data Preparation for Data Mining. San Francisco: Morgan Kaufmann, 1999, 
Shepard, R. N, “Multidimensional Scaling, Tree-Fitting, and Clustering.” Science, 210, 
no. 4468 (1980), 390-398. 

Sibson, R. “Studies in the Robustness of Multidimensional Scaling” Journal of the Royal 
Statistical Society (B), 40 (1978), 234-238. 

Takane, Y., F. W. Young, and J. De Leeuw. “Non-metric Individual Differences 
Multidimensional Scaling: Alternating Least Squares with Optimal Scaling Features,” 
Psycometrika, 42 (1977), 7-67. 

Ward, Jr., J. H. “Hierarchical Grouping to Optimize an Objective Function.” Journal of 
the American Statistical Association, 58 (1963), 236-244. 

Westphal, C., and T. Blaxton. Data Mining Solutions: Methods and Tools for Solving Real 
World Problems (Paperback). New York: John Wiley, 1998. 

Whitten, I. H., and E. Frank. Data Mining: Practical Machine Learning Tools and 
Techniques (2nd ed.) (Paperback). San Francisco: Morgan Kaufmann, 2005. 

Young, F. W., and R. M. Hamer. Multidimensional Scaling: History, Theory, and 
Applications. Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers, 1987. 


~_-lected Additional References for Model Based Clustering 


Banfield, J. D., and A. E. Raftery. “Model-Based Gaussian and Non-Gaussian Cluster- 
ing.” Biometrics, 49 (1993), 803-821. 

Biernacki, C., and G. Govaert. “Choosing Models in Model Based Clustering and 
Discriminant Analysis.” Journal of Statistical Computation and Simulation, 64 (1999), 
49-71. 

Celeux, G., and G. Govaert. “A Classification EM Algorithm for Clustering and Two 
Stochastic Versions.” Computational Statistics and Data Analysis, 14 (1992), 315-332. 
Fraley, C., and A. E. Raftery. “MCLUST: Software for Model Based Cluster Analysis.” 
Journal of Classification, 16 (1999), 297-306. 

Hastie, T., and R. Tibshirani. “Discriminant Analysis by Gaussian Mixtures.” Journal of 
the Royal Statistical Society (B), 58 (1996), 155-176. 

McLachlan, G. J., and K. E. Basford. Mixture Models: Inference and Applications to 
Clustering. New York: Marcel Dekker, 1988. 

Schwarz, G. “Estimating the Dimension of a Model.” Annals of Statistics, 6 (1978), 
461-464. 


Appendix 


Table 1 Standard Normal Probabilities ‘ 
able 2 Student’s r-Distribution Percentage Points 

Table 3” Distribution Percentage Points 

Table 4 F-Distribution Percentage Points (a = .10) 

Table 5 F-Distribution Percentage Points (a = .05) 


Table 6 F-Distribution Percentage Points (a = .01) 


757 


38 Appendix 


-ABLE 1 STANDARD NORMAL PROBABILITIES 


P[Z=z] 


02 .03 .04 05 .06 .07 08 09 


5080 5120 5160 5199 39.5239 5279 5319 5359 
5478 = 5517 5557 5596 = 5636. = 5675. 5714 5753 
5871 5910 5948 5987 6026 .6064 6103 6141 
6255 6293 .6331 6368 6406 6443 6480 6517 
6628  .6664 .6700 6736 6772 6808  .6844 6879 | 
6985  .7019 7054 = .7088_)~—— 7123S 7157 ~_—«.7190 7224 
.7324 — .7357 7389 = £7422, 7454-— 74867517 7549 
.7642 = .7673 .7703 7734 7764 .7794 — .7823 7852 
.7939 — .7967 7995 8023 8051 .8078 8106 8133 
.8212 8238 8264 8289 = .8315. 8340 ~—.8365 .8389 


z 

xe) 
1 
2 
3 
4 
5 
6 
7 
8 
9 


0 | 8413 8438 8461 8485 8508 =.8531. 8554 — 68577 ~——8599 8621 
1 | 8643 8665 8686 8708 8729 ~—.8749.-—s_ «8770 ~—s 8790 ~—«.88 10 8830 
2 | 8849 8869 8888  .8907 8925 8944 8962 8980 8997 9015 
3 | 9032 .9049 9066 .9082 9099 «9115. 9131 9147 9162 9177 
4 | 9192 = .9207. = .9222 ~—.9236 9251  .9265 9279 =.9292 ~—-.9306 9319 
*5 | 9332 .9345 9357 .9370 9382 .9394 9406 .9418  .9429 9441 
6 | .9452 .9463 .9474 .9484 9495 9505 9515 9525 ~—.9535 9545 
7 | 9554 9564 9573 .9582 9591 9599 9608 = .9616 = .9625 .9633 
8 | 9641 9649 =.9656 = .96 64 9671 9678 9686 9693 9699 9706 
9 | .9713 9719 = 9726 9732 9738 = .9744 9750. 9756 —9761 9767 


~-0 | 9772 9778 97839788 9793 .9798 = 9803. «9808 ~— 9812 9817 

1 | .9821 9826 9830  .9834 9838 9842 9846 9850 ~=—-.9854 9857 
2 | 9861 9864 .9868 .9871 9875 .9878 9881 .9884 9887 9890 
4.3 | 9893 9896 9898  .9901 9904 9906 .9909 =.9911 9913 .9916 
4 | .9918 9920 = .9922 = .9925 9927 9929 = 9931 9932, 9934 .9936 

5 | .9938 9940 9941 .9943 9945 9946 .9948 .9949 .9951 9952 
6 | .9953 9955 9956 .9957 9959 9960S 9961S 9962 .9963 .9964 
7.7 | 9965 9966 9967 9968 9969 9970 .9971 .9972  .9973 9974 
~8 | .9974 9975 9976 9977 9977 9978  .9979 9979  .9980 9981 

9 | .9981 9982 9982 = .9983 9984 9984 9985 9985 9986 .9986 


4.0 | .9987 9987 .9987 .9988 9988 .9989 .9989 .9989 .9990 .9990 
~-1 | .9990 9991 9991  .9991 9992 .9992 .9992. 9992s .9993 9993 

2 | 9993 9993 9994 .9994 9994 .9994 9994 .9995  .9995 -9995 
73°) 9995 9995 .9995 .9996 9996 9996 9996 .9996  .9996 9997 
4.4 | 9997 9997 9997 = .9997 9997 9997 9997 9997 9997 .9998 
v5 | 9998 9998  .9998  .9998 9998 .9998 .9998 9998  .9998 .9998 


Appendix 759 


TABLE 2. STUDENT'S t-DISTRIBUTION PERCENTAGE POINTS 


ee 


0 t(@) t 


. 
bs 
R 


v .250 .100 .050 025 .010 .00833 = .00625 .005 .0025 


: 


3.078 6.314 12.706 31.821 38.190 50.923 63.657 127.321 
816 1.886 2.920 4.303 6.965 7.649 8.860 9.925 14.089 
765 1.638 2.353 3.182 4.541 4.857 5.392 5.841 7453 
741 1.533 2.132 2.776 3.747 3.961 4.315 4.604 5.598 
1.476 2.015 2.571 3.365 3.534 3.810 4.032 4.773 
.718 1.440 1.943 2.447 3.143 3.287 3.521 3.707 4.317 

711 1.415 1.895 2.365 2.998 3.128 3.335 3.499 4.029 

.706 1.397 1.860 2.306 2.896 3.016 3.206 3.355 3.833 

703 1.383 1.833 2.262 2.821 2.933 3.111 3.250 3.690 

10 .700 1.372 1.812 2.228 2.764 2.870 3.038 3.169 3.581 

11 697 1.363 1.796 2.201 2.718 2.820 2.981 3.106 3.497 

12 695 1.356 1.782 2.179 2.681 2.779 2.934 3.055 3.428 

13 694 1.350 1.771 2.160 2.650 2.746 2.896 3.012 3.372 

14 692 1.345 1.761 2.145 2.624 2.718 2.864 2.977 3.326 

15 691 1.341 1.753 2.131 2.602 2.694 2.837 2.947 3.286 

16 .690 1.337 1.746 2.120 2.583 2.673 2.813 2.921 3.252 

17 689 1.333 1.740 2.110 2.567 2.655 2.793 2.898 3.222 

18 688 1.330 1.734 2.101 2.552 2.639 2.775 2.878 3.197 

19 688 1.328 1.729 2.093 2.539 2.625 2.759 2.861 3.174 

20 687 1.325 1.725 2.086 2.528 2.613 2.744 2.845 3.153 

21 686 1.323 1.721 2.080 2.518 2.601 2.732 2.831 3.135 

22 686 1.321 1.717 2.074 2.508 2.591 2.720 2.819 3.119 

23 685 1.319 1.714 2.069 2.500 2.582 2.710 2.807 3.104 

24 685 1.318 1.711 2.064 2.492 2.574 2.700 2.797 3.091 

25 684 1.316 1.708 2.060 2.485 2.566 2.692 2.787 3.078 

26 684 1.315 1.706 2.056 2.479 2.559 2.684 2.779 3.067 

27 684 1.314 1.703 2.052 2.473 2.552 2.676 2.771 3.057 

28 683 1.313 1.701 2.048 2.467 2.546 2.669 2.763 3.047 

29 .683 1.311 1.699 2.045 2.462 2.541 2.663 2.756 3.038 

30 .683 1.310 1.697 2.042 2.457 2.536 2.657 2.750 3.030 

40 681 1.303 1.684 2.021 2.423 2.499 2.616 2.704 2.971 

60 679 1:296 1.671 2.000 2.390 2.463 2.575 2.660 2.915 

120 677 1.289 1.658 1.980 2.358 2.428 2.536 2.617 2.860 
co .674 1.282 1.645 1.960 2.326 2.394 2.498 2.576 2.813 


OWMAIDMNAAWNH 
~] 
N 
~ 


“9 


Appendix 


"*BLE3 7 DISTRIBUTION PERCENTAGE POINTS 


8.67 

9.39 
10.12 
10.85 
11.59 
12.34 
13.09 
13.85 
14.61 
15.38 
16.15 
16.93 
17.71 
18.49 
26.51 
34.76 
43.19 
51.74 
60.39 
69.13 


21 
58 
1.06 
1.61 
2.20 
2.83 
3.49 
4.17 
4.87 
5.58 
6.30 
7.04 
7.79 
8.55 
9.31 
10.09 
10.86 
11.65 
12.44 
13.24 
14.04 
14.85 
15.66 
16.47 
17.29 
18.11 
18.94 
19.77 
20.60 
29.05 
37.69 
46.46 
55.33 
64.28 
73.29 


2.71 
4.61 
6.25 
7.78 
9.24 

10.64 
12.02 
13.36 
14.68 
15.99 
17.28 
18.55 
19.81 
21.06 
22.31 
23.54 
24.77 
25.99 
27.20 
28.41 
29.62 
30.81 
32.01 
33.20 
34.38 
35.56 
36.74 
37.92 
39.09 
40.26 
51.81 
63.17 
74.40 
85.53 
96.58 
107.57 
118.50 


3.84 
5.99 
7.81 
9.49 

11.07 
12.59 
14.07 
15.51 
16.92 
18.31 
19.68 
21.03 
22.36 
23.68 
25.00 
26.30 
27.59 
28.87 
30.14 
3141 
32.67 
33.92 
35.17 
36.42 
37.65 
38.89 
40.11 
41.34 
42.56 
43.77 
55.76 
67.50 
79.08 

90.53 

101.88 

113.15 

124.34 


5.02 
7.38 
9.35 

11.14 
12.83 
14.45 
16.01 
17.53 
19.02 
20.48 
21.92 
23.34 
24.74 
26.12 
27.49 
28.85 
30.19 
31.53 
32.85 
34.17 
35.48 
36.78 
38.08 
39.36 
40.65 
41.92 
43.19 
44.46 
45.72 
46.98 
59.34 
71.42 
83.30 
95.02 

106.63 

118.14 

129.56 


9.21 
11.34 
13.28 
15.09 
16.81 
18.48 
20.09 
21.67 
23,21 
24.72 
26.22 
27.69 
29.14 
30.58 
32.00 
33.41 
34.81 
36.19 
37.57 
38.93 
40.29 
41.64 
42.98 
44.31 
45.64 
46.96 
48.28 
49.59 
50.89 
63.69 
76.15 
88.38 

100.43 
112.33 
124.12 


10.60 
12.84 
14.86 
16.75 
18.55 
20.28 
21.95 
23.59 
25.19 
26.76 
28.30 
29.82 
31.32 
32.80 
34.27 
35.72 
37.16 
38.58 
40.00 
41.40 
42.80 
44.18 
45.56 
46.93 
48.29 
49.64 
50.99 
52.34 
53.67 
66.77 
79.49 
91.95 

104.21 

116.32 

128.30 


TABLE 4 F-DISTRIBUTION PERCENTAGE POINTS (a = .10) 


COAIY NUN FF WN SE 


39.86 49.50 53.59 55.83 57.24 58.20 58.91 


8.53 
5.54 
4.54 
4.06 
3.78 
3.59 
3.46 
3.36 
3.29 
3.23 
3.18 
3.14 
3.10 
3.07 
3.05 
3.03 
3.01 
2.99 
2.97 
2.96 
2.95 
2,94 
2.93 
2.92 
2.91 
2.90 
2.89 
2.89 
2.88 
2.84 
2.79 
2.75 
2.71 


9.00 
5.46 
4.32 
3.78 
3.46 
3.26 
3.11 
3.01 
2.92 
2.86 
2.81 
2.76 
2.73 
2.70 
2.67 
2.64 
2.62 
2.61 
2.59 
2.57 
2.56 
2.55 
2.54 
2.53 
2.52 
2.51 
2.50 
2.50 
2.49 
2.44 
2.39 
2.35 
2.30 


9.16 
5.39 
4.19 
3.62 
3.29 
3.07 
2.92 
2.81 
2.73 
2.66 
2.61 
2.56 
2.52 
2.49 
2.46 
2.44 
2.42 
2.40 
2.38 
2.36 
2.35 
2.34 
2.33 
2.32 
2.31 
2.30 
2.29 
2.28 
2.28 
2.23 
2.18 
2.13 
2.08 


9.24 
5.34 
4.11 
3.52 
3.18 
2.96 
281 
2.69 
2.61 
2.54 
2.48 
2.43- 
2.39 
2.36 
2.33 
2.31 
2.29 
2.27 
2.25 
2.23 
2.22 
2.21 
2.19 
2.18 
2.17 
2.17 
2.16 
2.15 
2.14 
2.09 
2.04 
1.99 
1.94 


9.29 
5.31 
4.05 
3.45 
3.11 
2.88 
2.73 
2.61 
2.52 
2.45 
2.39 
2.35 
2.31 
2.27 
2.24 
2.22 
2.20 
2.18 
2.16 
2.14 
2.13 
2.11 
2.10 
2.09 
2.08 
2.07 
2.06 
2.06 
2.05 
2.00 
1.95 
1.90 
1.85 


933 


5.28 
4.01 
3.40 
3.05 
2.83 
2.67 
2.55 
2.46 
2.39 
2.33 
2.28 
2.24 
2.21 
2.18 
2.15 
2.13 
2.11 
2.09 
2.08 
2.06 
2.05 
2.04 
2.02 
2.01 
2.00 
2.00 
1.99 
1.98 
1.93 
1.87 
1.82 
1.77 


9.35 
5.27 
3.98 
3.37 
3.01 
2.78 
2.62 
2.51 
2.41 
2.34 
2.28 
2.23 
2.19 
2.16 
2.13 
2.10 
2.08 
2.06 
2.04 
2.02 
2.01 
1.99 
1.98 
1.97 
1.96 
1.95 
1.94 
1.93 
1.93 
187 
182 
1.77 
1.72 


59.44 59.86 60.19 60.71 61.22 


9.37 
5.25 
3.95 
3.34 
2.98 
2.75 
2.59 
2.47 
2.38 
2.30 
2.24 
2.20 
2.15 
2.12 
2.09 
2.06 
2.04 
2.02 
2.00 
1.98 
1.97 
1.95 
1.94 
1.93 
1.92 
1.91 
1.90 
1.89 
1.88 
1.83 
1.77 
1.72 
1.67 


9.38 
5.24 
3.94 
3.32 
2.96 
2.72 
2.56 
2.44 
2.35 
2.27 
2.21 
2.16 
2.12 
2.09 
2.06 
2.03 
2.00 
1.98 
1.96 
1.95 
1.93 
1.92 
1.91 
1.89 
1.88 
1.87 
1.87 
1.86 
1.85 
1.79 
1.74 
1.68 
1.63 


9.39 
5.23 
3.92 
3.30 
2.94 
2.70 
2.54 
2.42 
2.32 
2.25 
2.19 
2.14 
2.10 
2.06 
2.03 
2.00 
1.98 
1.96 
1.94 
1.92 
1.90 
1.89 
1.88 
1.87 
1.86 
1.85 
1.84 
1.83 
1.82 
1.76 
171 
1.65 
1.60 


9.41 
5.22 
3.90 
3.27 
2.90 
2.67 
2.50 
2.38 
2.28 
2.21 
2.15 
2.10 
2.05 
2.02 
1.99 
1.96 
1.93 
1.91 
1.89 
1.87 
1.86 
1.84 
1.83 
182 
1.81 
1.80 
1.79 
1.78 
1.77 
1.71 
1.66 
1.60 
1.55 


9.42 
5.20 
3.87 
3.24 
2.87 
2.63 
2.46 
2.34 
2.24 
2.17 
2.10 
2.05 
2.01 
1.97 
1.94 
1.91 
1.89 
1.86 
1.84 
1.83 
1.81 
1.80 
1.78 
1.77 
1.76 
1.75 
1.74 
1.73 
1.72 
1.66 
1.60 
1.55 
1.49 


Appendix 


62.05 
9.45 
5.17 
3.83 
3.19 
2.81 
2.57 
2.40 
2.27 
2.17 
2.10 
2.03 
1.98 
1.93 
1.89 
1.86 
1.83 
1.80 
1.78 
1.76 
1.74 
1.73 
1.71 
1.70 
1.68 
1.67 
1.66 
1.65 
1.64 
1.63 
1.57 
1.50 
1.45 
1.38 


62.26 62.53 


9.46 
5.17 
3.82 
3.17 
2.80 
2.56 
2.38 
2.25 
2.16 
2.08 
2.01 
1.96 
1.91 
1.87 
1.84 
1.81 
1.78 
1.76 
1.74 
1.72 
1.70 
1.69 
1.67 
1.66 
1.65 
1.64 
1.63 
1.62 
1.61 
1.54 
1.48 
141 
1.34 


9.47 
5.16 
3.80 
3.16 
2.78 
2.54 
2.36 
2.23 
2.13 
2.05 
1.99 
1.93 
1.89 
1.85 
1.81 
1.78 
1.75 
1.73 
1.71 
1.69 
1.67 
1.66 
1.64 
1.63 
1.61 
1.60 
1.59 
1.58 
1.57 
151 
1.44 
137 
1.30 


761 


62.79 
9.47 
5.15 
3.79 
3.14 
2.76 
2.51 
2.34 
2.21 
2.11 
2.03 
1.96 
1.90 
1.86 
1.82 
1.78 
1.75 
1.72 
1.70 
1.68 
1.66 
1.64 
1.62 
1.61 
1.59 
1.58 
1.57 
1.56 
1.55 
1.54 
1.47 
1.40 
1.32 
1.24 


62 Appendix 


ABLE5 F-DISTRIBUTION PERCENTAGE POINTS (a = .05) 


05 


EF (05) F 


Viv? 


1 2 3 4 5 6 #7 8 9 10 12 15° 20 25 30 40 60 


161.5 199.5 215.7 224.6 230.2 234.0 236.8 238.9 240.5 241.9 243.9 246.0 248.0 249.3 250.1 251.1 252.2 
18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.38 19.40 19.41 19.43 19.45 19.46 19.46 19.47 19.48 
10.13 9.55 9.28 9.12 9.01 8.94 889 885 881 8.79 8.74 870 866 863 862 859 8.57 
7.71 694 6.59 639 6.26 6.16 609 604 6.00 5.96 5.91 5.86 5.80 5.77 5.75 5.72 5.69 
661 5.79 5.41 519 505 4.95 488 482 4.77 4.74 468 4.62 456 452 450 4.46 443 
5.99 514 4.76 453 439 4.28 4.21 415 410 406 400 3.94 387 3.83 3.81 3.77 3.74 
5.59 4.74 435 4.12 3.97 3.87 3.79 3.73 368 364 3.57 3.51 3.44 3.40 3.38 3.34 3.30 
532 4.46 407 384 369 3.58 3.50 344 339 335 3.28 3.22 3.15 3.11 3.08 3.04 3.01 
512 4.26 3.86 363 348 3.37 3.29 3.23 3.18 3.14 3.07 3.01 2.94 289 286 283 2.79 
4.96 410 3.71 348 3.33 3.22 3.14 3.07 3.02 2.98 2.91 285 2.77 2.73 2.70 2.66 2.62 
4.84 3.98 3.59 336 3.20 3.09 3.01 2.95 2.90 285 2.79 2.72 265 2.60 257 253 2.49 
475 389 349 3.26 3.11 3.00 291 285 280 2.75 269 262 254 2.50 247 2.43 2.38 
4.67 3.81 341 3.18 3.03 2.92 2.83 2.77 271 2.67 2.60 2.53 2.46 2.41 238 2.34 2.30 
460 3.74 3.34 3.11 296 2.85 2.76 2.70 2.65 2.60 2.53 2.46 2.39 2.34 231 2.27 2.22 
4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 259 2.54 248 240 2.33 2.28 2.25 2.20 2.16 
449 3.63 3.24 3.01 285 2.74 266 259 254 249 242 2.35 2.28 2.23 2.19 215 2.11 
445 3.59 3.20 2.96 281 2.70 2.61 2.55 249 2.45 2.38 2.31 2.23 2.18 215 2.10 2.06 
4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 246 2.41 2.34 2.27 2.19 2.14 2.11 2.06 2.02 
4.38 3.52 3.13 2.90 2.74 263 2.54 248 242 2.38 2.31 2.23 2.16 2.11 2.07 2.03 1.98 
435 3.49 3.10 2.87 2.71 2.60 2.51 245 2.39 2.35 2.28 2.20 2.12 2.07 2.04 1.99 1.95 
432 347 3.07 284 268 257 249 242 237 2.32 2.25 218 210 2.05 201 1.96 1.92 
430 3.44 3.05 2.82 2.66 2.55 246 240 2.34 230 2.23 2.15 2.07 2.02 1.98 1.94 1.89 
4.28 342 3.03 2.80 2.64 253 2.44 237 2.32 2.27 2.20 2.13 2.05 2.00 1.96 1.91 1.86 
4.26 3.40 3.01 2.78 262° 2.51 242 2.36 2.30 2.25 2.18 2.11 2.03 1.97 194 189 184 
4.24 339 2.99 2.76 260 249 2.40 2.34 2.28 2.24 2.16 2.09 2.01 196 1.92 187 1.82 
4.23 3.37 2.98 2.74 2.59 2.47 239 2.32 2.27 2.22 2.15 2.07 1.99 1.94 1.90 1.85 1.80 


at 4.21 3.35 2.96 2.73 257 2.46 2.37 2.31 2.25 2.20 2.13 2.06 197 1.92 1.88 1.84 1.79 
28 4.20 3.34 2.95 2.71 256 245 2.36 2.29 2.24 2.19 2.12 2.04 1.96 1.91 1.87 182 1.77 
*) 4.18 3.33 2.93 2.70 2.55 243 235 2.28 2.22 2.18 2.10 2.03 1.94 189 1.85 1.81 1.75 

5) 417 332 2.92 269 2.53 242 2.33 227 221 216 2.09 2.01 1.93 188 1.84 1.79 1.74 


408 3.23 284 261 2.45 2.34 2.25 218 2.12 2.08 200 1.92 1.84 1.78 1.74 1.69 1.64 
400 3.15 2.76 2.53 2.37 2.25 2.17 210 2.04 199 1.92 184 1.75 169 1.65 1.59 1.53 
3.92 3.07 268 2.45 2.29 2.18 2.09 2.02 1.96 1.91 1.83 1.75 166 1.60 1.55 1.50 143 
384 300 261 2.37 2.21 210 201 194 188 183 1.75 167 1.57 151 146 139 1.32 


TABLE 6 F-DISTRIBUTION PERCENTAGE POINTS (a = 


4052. 
98.50 
34.12 
21.20 
16.26 
13.75 
12.25 
11.26 
10.56 
10.04 

9.65 
9.33 
9.07 
8.86 
8.68 
8.53 
8.40 
8.29 
8.18 
8.10 
8.02 
7.95 
7.88 
7.82 
777 
7.72 
7.68 
7.64 
7.60 
7.56 
731 
7.08 
_ 6.85 
6.63 


01 


F, (01) 


5000. 5403. 5625. 5764. 5859. 5928. 


99.00 
30.82 
18.00 
13.27 
10.92 
9.55 
8.65 
8.02 
7.56 
721 
6.93 
6.70 
6.51 
6.36 
6.23 
6.11 
6.01 
5.93 
5.85 
5.78 
5.72 
5.66 
5.61 
a7 
5.53 
5.49 
5.45 
5.42 
5.39 
5.18 
4.98 
4.79 
4.61 


99.17 
29.46 
16.69 
12.06 
9.78 
8.45 
759 
6.99 
655 
6.22 
5.95 
5.74 
5.56 
5.42 
5.29 
519 
5.09 
5.01 
4.94 
4.87 
4.82 
4.76 
4.72 
4.68 
4.64 
4.60 
4.57 
4.54 
451 
4.31 
4.13 
3.95 
3.78 


99.25 
28.71 
15.98 
11.39 
9.15 
785 
7.01 
6.42 
5.99 
5.67 
5.41 
5.21 
5.04 
4.89 
4.77 
4.67 
4.58 
4.50 
4.43 
437 
431 
4.26 
422 
4.18 
4.14 
4.11 
4.07 
4.04 
4.02 
3.83 
3.65 
3.48 
3.32 


99.30 
28.24 
15.52 
10.97 
8.75 
7.46 
6.63 
6.06 
5.64 
5.32 
5.06 
4.86 
4.69 
4.56 
4.44 
4.34 
4.25 
4.17 
4.10 
4.04 
3.99 
3.94 
3.90 
3.85 
3.82 
3.78 
3.75 
3.73 
3.70 
3.51 
3.34 
3.17 
3.02 


99.33 
27.91 
15.21 
10.67 
8.47 
7.19 
6.37 
5.80 
5.39 
5.07 
4.82 
4.62 
4.46 
4.32 
4.20 
4.10 
4.01 
3.94 
3.87 
3.81 
3.76 
371 
3.67 
3.63 
3.59 
3.56 
3.53 
3.50 
3.47 
3.29 
3.12 
2.96 
2.80 


99.36 
27.67 
14.98 
10.46 
8.26 
6.99 
6.18 
5.61 
5.20 
4.89 
4.64 
4.44 
4.28 
4.14 
4.03 
3.93 
3.84 
3.77 
3.70 
3.64 
3.59 
3.54 
3.50 
3.46 
3.42 
3.39 
3.36 
3.33 
3.30 
3.32 
2.95 
2.79 
2.64 


VEs¥2 


5981. 
99.37 
27.49 
14.80 
10.29 

8.10 
6.84 
6.03 
5.47 
5.06 
4.74 
4.50 
4.30 
4.14 
4.00 
3.89 
3.79 
3.71 
3.63 
3.56 
3.51 
3.45 
341 
3.36 
3.32 
3.29 
3.26 
3.23 
3.20 
3.17 
2.99 
2.82 
2.66 
251 


99.39 
27.35 
14.66 
10.16 
7.98 
6.72 
591 
5.35 
4.94 
4.63 
439 
4.19 
4.03 
3.89 
3.78 
3.68 
3.60 
3.52 
3.46 
3.40 
3.35 
3.30 
3.26 
3.22 
3.18 
3.15 
3.12 
3.09 
3.07 
2.89 
2.72 
2.56 
241 


01) 


99.40 
27.23 
14.55 
10.05 
787 
6.62 
5.81 
5.26 
4.85 
454 
430 
4.10 
3.94 
3.80 
3.69 
3.59 
3.51 
3.43 
3.37 
331 
3.26 
321 
3.17 
3.13 
3.09 
3.06 
3.03 
3.00 
2.98 
2.80 
2.63 
2.47 
2.32 


99.42 
27.05 
14.37 
9.89 
7.72 
6.47 
‘5.67 
5.11 
4.71 
4.40 
4.16 
3.96 
3.80 
3.67 
3.55 
3.46 
3.37 
3.30 
3.23 
3.17 
3.12 
3.07 
3.03 
2.99 
2.96 
2.93 
2.90 
2.87 
2.84 
2.66 
2.50 
2.34 
2.18 


6023. 6056. 6106. 6157. 


99.43 
26.87 
14.20 
9.72 
7.56 
6.31 
5.52 
4.9% 
4.56 
425 
4.01 
3.82 
3.66 
3.52 
3.41 
3.31 
28 
3.15 
3.09 
3.03 
2.98 
2.93 
2.89 
2.85 
2.81 
2.78 
2.75 
2.73 
2.70 
2.52 
2.35 
2.49 
2.04 


Appendix 


763 


6209. 6240. 6261. 6287. 


99.45 
26.69 
14.02 
9.55 
7.40 
6.16 
5.36 
4.81 
4.41 
4.10 
3.86 
3.66 
3.51 
3.37 
3.26 
3.16 
3.08 
3.00 
2.94 
2.88 
2.83 
2.78 
2.74 
2.70 
2.66 
2.63 
2.60 
2.57 
2.55 
2.37 
2.20 
2.03 
1.88 


99.46 
26.58 
13.91 
9.45 
7.30 
6.06 
5.26 
471 
4.31 
4.01 
3.76 
3.57 
3.41 
3.28 
3.16 
3.07 
2.98 
2.91 
2.84 
2.79 
2.73 
2.69 
2.64 
2.60 
2.57 
2.54 
2.51 
2.48 
2.45 
2at 
2.10 
1.93 
1.78 


99.47 
26.50 
13.84 
9.38 
7.23 
5.99 
5.20 
4.65 
4.25 
3.94 
3.70 
3.51 
3.35 
3.21 
3.10 
3.00 
2.92 
2.84 
2.78 
2.72 
2.67 
2.62 
2.58 
2.54 
2.50 
2.47 
2.44 
241 
2.39 
2.20 
“2.03 
1.86 
1.70 


99.47 
26.41 
13.75, 
9.29 
7:14 
5.91 
5.12 
457 
4.17 
3.86 
3.62 
3.43 
3.27 
3.13 
3.02 
2.92 
2.84 
2.76 
2.69 
2.64 
2.58 
254 
2.49 
2.45 
2.42 
2.38 
2.35 
2.33 
2.30 
2.11 
194 
1.76 
1.59 


6313. 
99.48 
26.32 
13.65 

9.20 
7.06 
5.82 
5.03 
448 
4.08 
3.78 
3.54 
3.34 
3.18 
3.05 
2.93 
2.83 
2.75 
2.67 
2.61 
2.55 
2.50 
2.45 
2.40 
2.36 
2.33 
2.29 
2.26 
2.23 
2.21 
2.02° 
1.84 
1.66 
1.47 


Data Index 


Admission, 661 
examples, 614, 660 
Airline distances, 710 
example, 709 
Air pollution, 39 
examples, 39, 206, 425, 474, 535 
Amitriptyline, 426 
example, 426 
Anaconda snake, 357 
example, 356 
Archeological site distances, 752 
examples, 750, 754 


Bankruptcy, 657 
examples, 45, 656, 658 
Battery failure, 424 
example, 424 
Biting fly, 352 
example, 350 
Bonds, 346 
example, 345 
Bones (mineral content), 43, 353 
examples, 41, 207, 268, 350, 351, 
425, 476 
Breakfast cereal, 666 
examples, 45, 665, 750 
Bull, 46 


examples, 46, 207, 425, 476, 537, 665 


Calcium (bones), 329, 330 
example, 331 
Carapace (painted turtles), 344, 532 
examples, 343, 356, 445, 454, 532 
Car body assembly, 271 
examples, 270, 480 


Census tract, 474 
examples, 443, 474, 535 
College test scores, 228 
examples, 226, 267, 423 
Computer requirements, 380, 400 
examples, 380, 383, 400, 405, 408, 
410, 412 
Concho water snake, 668 
example, 665 
Crime, 569 
example, 569 
Crude oil, 662 
examples, 347, 356, 625, 
661, 754 


Diabetic, 572 
example, 572 


Effluent, 276 
examples, 276, 337, 338 
Egyptian skull, 349 
examples, 269, 347 
Electrical consumption, 289 
‘ examples, 289, 293, 295, 338, 356 
Electrical time-of-use pricing, 350 
example, 349 
Energy consumption, 147 
examples, 147, 270 
Examination scores, 505 
example, 505 


Female bear, 24 
examples, 24, 262 

Forest, 736 
examples, 736, 753 


Fowl, 521 
examples, 520, 532, 552, 559 


Grizzly bear, 262, 478 
examples, 262, 478 


Hair (Peruvian), 263 
example, 263 

Hemophilia, 587, 664, 665 
examples, 587, 591, 663 

Hook-billed kite, 268, 346 
examples, 268, 344 


Iris, 658 
examples, 347, 619, 645, 658, 660, 705 


Job satisfaction/characteristics, 
555, 753 
examples, 553, 563, 565, 753 


Lamentations, 749 
example, 749 

Largest companies, 38 
examples, 38, 183, 205, 206, 423, 471 

Lizards—two genera, 335 
example, 334 

Lizard size, 17 
examples, 17, 18 

Love and marriage, 326 
example, 325 

Lumber, 267 
example, 267 


Mali family farm, 479 
examples, 479, 538, 754 
Mental health, 753 
example, 753 
Mice, 453, 475 
examples, 453, 458, 475, 537 
Milk transportation cost, 269, 345 
examples, 45, 268, 343 
Multiple sclerosis, 42 
examples, 41, 207, 656 
Musical aptitude, 236 
example, 236 


National parks, 47 
examples, 46, 208 


Data Index 765 


National track records, 44, 477 
examples, 43, 207, 357, 476, 
537, 750 
Natural gas, 414 
example, 413 
Number parity, 342 
example, 342 
Numerals, 679 
examples, 678, 684, 687, 690 
Nursing home, 306-07 
examples, 306, 309, 311 


Olympic decathlon, 499 
examples, 499, 511, 573 
Overtime (police), 240, 478 
examples, 239, 242, 244, 248, 269, 
270, 460, 463, 464, 478 
Oxygen consumption, 348 
examples, 45, 347 


Paper quality, 15 
examples, 14, 20, 207 
Peanut, 354 
example, 353 
Plastic film, 318 
example, 318 
Pottery, 716 
examples, 716, 753 
Profitability, 533 
examples, 533, 571 
Psychological profile, 207 
examples, 207, 478, 537 
Public utility, 688 
examples, 26, 28, 45, 46, 688, 690, 
699, 711, 726 
Pulp and paper properties, 427 
examples, 427, 478, 537, 538, 573 


Radiation, 180, 198 
examples, 180, 197, 206, 221, 226, 
233, 261 
Radiotherapy, 42 
examples, 41, 207, 475 
Reading/arithmetic test scores, 
569 
example, 569 
Real estate, 372 
examples, 372, 423 


766 


Data Index 


Relay tower breakdowns, 358, 428 
examples, 357, 427 

Road distances, 751 
example, 750 


Salmon, 604 
examples, 603, 639, 663, 669 
Sleeping dog, 282 
example, 281 
Smoking, 573 
example, 572 
Snow removal, 148 
examples, 148, 208, 270 
Spectral reflectance, 355 
examples, 354, 355 
Spouse, 351 
example, 350 


Stiffness (lumber), 186, 190 
examples, 186, 190, 342, 
535, 571 
Stock price, 473 
examples, 451, 457, 473, 493, 497 
503, 510, 517, 570, 748 
Sweat, 215 
examples, 214, 261, 475 


University, 729 
examples, 713, 729, 731 


Welder, 245 
example, 244 

Wheat, 571 
example, 570 


Subject Index 


Akaike Information Criterion (AIC), 
386, 397, 704 
Analysis of variance, multivariate: 
one-way, 301 
two-way, 315, 340 
Analysis of variance, univariate: 
one-way, 297 
two-way, 312 
ANOVA (see Analysis of variance, 
univariate) 
Autocorrelation, 414 
Autoregressive model, 415 
Average linkage (see Cluster analysis) 
Bayesian Information Criterion 
(BIC), 705 
Biplot, 726,730 
Bonferroni intervals: 
comparison with 7? intervals, 234 
definition, 232 
for means, 232, 276, 291 
for treatment effects, 309, 317-18 
Box’s M test (see Covariance matrix, 
test for equality of) 


Canonical correlation analysis: 

canonical correlations, 539, 541, 
547, 551 

canonical variables, 539, 
541-42, 551 

correlation coefficients in, 546, 
551-52 

definition of, 541, 550 

errors of approximation, 558 

geometry of, 549 


interpretation of, 545 
population, 541-42 
sample, 550-51 
tests of hypothesis in, 563-64 
variance explained, 561-62 
CART, 644 
Central-limit theorem, 176 
Characteristic equation, 97 
Characteristic roots (see Eigenvalues) 
Characteristic vectors (see 
Eigenvectors) 
Chernoff faces, 27 
Chi-square plots, 184 
Classification: 
Anderson statistic, 592 
Bayes’ rule, 584, 608 
confusion matrix, 598 
error rates, 596, 598, 599 
expected cost, 581, 607 
Lachenbruch holdout procedure, 
599, 619 
linear discriminant functions, 585, 
586, 590, 591, 611, 623 
with logistic regression, 638-39 
misclassification probabilities, 579- 
80, 583 
with normal populations, 584, 
593, 609 
quadratic discriminant function, 
594, 610 
qualitative variables, 644 
selection of variables, 648 
for several groups, 606, 629 
for two groups, 576, 584, 591 
Classification trees, 644 


767 


768 Subject Index 


Cluster analysis: partial, 409 
algorithm, 681, 696 sample, 8, 117 
average linkage, 681, 690 Correlation matrix: 
complete linkage, 681, 685 population, 72 
dendrogram, 681 sample, 9 
hierarchical, 680 tests of hypotheses for 
inversions in, 695 equicorrelation, 457-58 
K-means, 696 Correspondence analysis: 
similarity and distance, 677 algebraic development, 718 
similarity coefficients, 675, 678 . correspondence matrix, 718 
single linkage, 681, 682 inertia, 716, 717,725 
with statistical models, 703 matrix approximation method, 724 
Ward’s method, 692 profile approximation method, 724 
Coefficient of determination, 367, 403 Correspondence matrix, 718 
Communality, 484 Covariance: 
Complete linkage (see Cluster definitions of, 69 
analysis) of linear combinations, 75, 76 
Confidence intervals: sample, 8 
mean of normal population, 211 Covariance matrix: 
simultaneous, 225, 232, 235, 265, definitions of, 69 
276, 309, 317-18 distribution of, 175 
Confidence regions: factor analysis models for, 483 
for contrasts, 281 geometrical interpretation of 
definition, 220 sample, 119, 124-26 
for difference of mean vectors, large sample behavior, 175 
286, 292 as matrix operation, 139 
for mean vectors, 221 partitioning, 73,78 
for paired comparisons, 276 population, 71 
Contingency table, 716 sample, 123 
Z Contrast matrix, 280 test for equality of, 310 
Contrast vector, 279 
Control chart: Data mining: 
definition, 239 lift chart, 742 
ellipse format, 241, 250, 460 model assessment, 742 
for subsample means, 249, 251 process, 741 
multivariate, 241, 461-62, 465 Dendrogram, 681 
T’ chart, 243, 248, 250, 251, 462 Descriptive statistics: 
Control regions: correlation coefficient, 8 
definition, 247 covariance, 8 
for future observations, 247, 251, mean, 7 
463 variance, 7 
Correlation: Design matrix, 362, 388, 411 
autocorrelation, 414 Determinant: 
coefficient of, 8,71 computation of, 93 
geometrical interpretation of product of eigenvalues, 104 
sample, 119 Discriminant function (see 


multiple, 367, 403, 548 Classification) 


Distance: 
Canberra, 674 
Czekanowski, 674 
development of, 30-37, 64 
Euclidean, 30 
Minkowski, 673 
properties, 37 
statistical, 31, 36 
Distributions: 
chi-square (table), 760 
F (table), 761, 762, 763 
multinomial, 264 
normal (table), 758 
Q-Q plot correlation coefficient 
(table), 181 
t (table), 759 
Wishart, 174 


Eigenvalues, 97 
Eigenvectors, 98 
EM algorithm, 252 
Estimation: 
generalized least squares, 422 
least squares, 364 
maximum likelihood, 168 
minimum variance, 369-70 
unbiased, 121, 123, 369-70 
weighted least squares, 420 
Estimator (see Estimation) 
Expected value, 67, 68 
Experimental unit, 5 


Factor analysis: 
bipolar factor, 506 
common factors, 482, 483 
communality, 484 
computational details, 527 
of correlation matrix, 490, 494, 529 
Heywood cases, 497, 529 
least squares (Bartlett) computation 
of factor scores, 514, 515 
loadings, 482, 483 
maximum likelihood estimation 
in, 495 
nonuniqueness of loadings, 487 
oblique rotation, 506, 512 
orthogonal factor model, 483 


Subject Index 769 


plincipal component estimation 
in, 488, 490 
plincipal factor estimation in, 494 
regression computation of factor 
scores, 516, 517 
residual matrix, 490 
rotation of factors, 504 
specific factors, 482, 483 
specific variance, 484 
strategy for, 520 
testing for the number of 
factors, 501 
varimax criterion, 507 
Factor loading matrix, 482 
Factor scores, 515, 517 
Fisher’s linear discriminants: 
population, 654 
sample, 590-91, 623 
scaling, 589 


Gamma plot, 184 
Gauss (Markov) theorem, 369 
Generalized inverse, 369, 421 
Generalized least squares (see 
Estimation) 
Generalized variance: 
geometric interpretation of sample, 
124, 135-36 
sample, 123, 135 
situations where Zero, 133 
General linear model: 
design matrix for, 362, 388 
multivariate, 388 
univariate, 362 
Geometry: 
of classification, 618 
generalized variance, 124, 135-36 
of least squares, 367 
of principal components, 468, 469 
of sample, 119 
Gram-Schmidt process, 86 
Graphical techniques: 
biplot, 726, 730 
Cheroff faces, 27 
marginal dot diagrams, 12 
n points in p dimensions, 17 
Pp points in n dimensions, 19 


Subject index 


Graphical techniques (continued) 
scatter diagram (plot), 11, 20 
stars, 26 

Growth curve, 24, 328 


Hat matrix, 364, 421, 643 
Heywood cases (see Factor analysis) 
Hotelling’s 7? (see T*-statistic) 


Independence: 
definition, 69 
of multivariate normal variables, 
159-60 
of sample mean and covariance 
matrix, 174 
tests of hypotheses for, 472 
Inequalities: 
Cauchy-Schwarz, 78 
extended Cauchy-Schwarz, 79 
Inertia, 725 
Influential observations, 384, 643 
Invariance of maximum likelihood 
estimators, 172 
Item: (individual), 5 


K-nieans (see Cluster analysis) 


Lawley-Hotelling trace statistic, 
336, 398 
Leverage, 381, 384 
Lift chart, 742 
Likelihood function, 168 
Likelihood ratio tests: 
definition, 219 
limiting distribution, 220 
in regression, 374, 396 
and 7%, 218 
Linear combination of vectors, 
83, 165 
Linear combination of variables: 
mean of, 76 
normal populations, 156, 157 
sample covariances of, 141, 144 
sample means of, 141, 144 
variance and covariances of, 76 
Logistic classification: 
classification rule, 638-39 


linear discriminant, 639 
Logistic regression: 
deviance, 642 
estimation in, 637-38 
logit, 635 
logistic curve, 636 
model, 637 
residuals, 643 
tests of regression coefficients, 638 


MANOVA (see Analysis of variance, 
multivariate) 
Matrices: 
addition of, 88 
characteristic equation of, 97 
correspondence, 718 
definition of, 54, 87 
determinant of, 93, 104 
dimension of, 88 
eigenvalues of, 59, 97, 98 
eigenvectors of, 59, 98 
generalized inverses of, 364, 
369, 421 
identity, 58, 90 
inverses of, 58, 95 
multiplication of, 56, 90, 109 
orthogonal, 59, 97 
partitioned, 73, 74, 78 
positive definite, 61, 62 
products of, 56, 90, 91 
random, 66 
rank of, 94 
scalar multiplication in, 89 
singular and nonsingular, 95 
singular-value decomposition, 100, 
721, 725,728 
spectral decomposition, 61, 100 
square root, 66 
symmetric, 57,90 
trace of, 96 
transpose of, 55, 89 
Maxima and minima (with matrices), 
79, 80 
Maximum likelihood estimation: 
development, 170-72 
invariance property of, 172 
in regression, 370, 395, 404-05 


Mean, 66 

Mean vector: 
definition, 69 
distribution of, 174 
large sample behavior, 175 
as matrix operation, 139 
partitioning, 73, 78 
sample, 9, 78 

Minimal spanning tree, 715 

Missing observations, 251 

Mixture model, 703 

Model based clustering: 
estimation in, 704 
mixture model, 703 
model selection, 704-05 

Model selection criterion: 
AIC, 386, 397, 704 
BIC, 705 

Multicollinearity, 386 

Multidimensional scaling: 
algorithm, 709 
development, 706-15 
sstress, 709 
stress, 708 

Multiple comparisons (see 

Simultaneous confidence 
intervals) 

Multiple correlation coefficient: 
population, 403, 548 
sample, 367 

Multiple regression (see Regression 

and General linear model) 

Multivariate analysis of variance (see 


Analysis of variance, multivariate) 


Multivariate control chart (see 
Control chart) 

Multivariate normal distribution (see 
Normal distribution, multivariate) 


Neural network, 647 
Nonlinear mapping, 715 
Nonlinear ordination, 738 
Normal distribution: 
bivariate, 151 
checking for normality, 177 
conditional, 160-61 
constant density contours, 153, 435 


Subject Index 771 


marginal, 156, 158 
maximum likelihood estimation 
in, 171 
multivariate, 149-55 
properties of, 156-67 
transformations to, 192 
Normal equations, 421 
Normal probability plots (see Q-@ 
plots) 


Outliers: 
definition, 187 
detection of, 189 


Paired coniparisons, 273-79 
Partial correlation, 409 
Partitioned matrix: 
definition, 73, 74,78 
determinant of, 202-03 
inverse of, 203 
Piliai’s trace statistic, 336, 398 
Plots: 
biplot, 726 
biplot, alternative, 730-31 
C,, 385 
factor scores, 515,517 
gamma (or chi-square), 184 
ptincipal components, 454-55 
Q-Q, 178, 382 
residual, 382-83 
scree, 445 
Positive definite (see Quadratic forms) 
Posterior probabilities, 584, 608 
Principal component analysis: 
correlation coefficients in, 433, 
442,451 
for correlation matrix, 437, 451 
definition of, 431-32, 442 
equicorrelation matrix, 440-41 
geometry of, 466-70 
interpretation of, 435-36 
large-sample theory of, 456-69 
monitoring quality with, 459-65 
plots, 454-55 
population, 431-41 
reduction of dimensionality by, 
466-68 


Subject index 


Principal component analysis 
(continued) 
sample, 441-53 
tests of hypotheses in, 457-59, 472 
variance explained, 433, 437, 451 
Procustus analysis: 
development, 732-39 
measure of agreement, 733 
rotation, 733 
Profile analysis, 323-28 
Proportions: 
large-sample inferences, 264-65 
multinomial distribution, 264 


Q-O plots: 
correlation coefficient, 181 
critical values, 181 
description, 177-82 
Quadratic forms: 
definition, 62, 99 
extrema, 80 
nonnegative definite, 62 
positive definite, 61, 62 


Random matrix, 66 
Random sample, 119-20 
Regression (see also General linear 
model): 
autoregressive model, 415 
assumptions, 361-62, 370, 388, 395 
coefficient of determination, 
367, 403 
confidence regions in, 371, 378, 
399, 421 
C, plot, 385 
decomposition of sum of squares, 
366-67, 389 
extra sum of squares and cross 
products, 374, 396 
fitted values, 364, 389 
forecast errors in, 379 
Gauss theorem in, 369 
geometric interpretation of, 367 
least squares estimates, 364, 393 
likelihood ratio tests in, 374, 396 
maximum likelihood estimation in, 
370-71, 395, 404, 407 


multivariate, 387-401 
regression coefficients, 364, 406 
regression function, 370, 404 
residual analysis in, 381-83 
residuals, 364, 381, 389 
residual sum of squares and cross 
products, 364, 389 
sampling properties of estimators, 
369-71, 393, 395 
selection of variables, 385-86 
univariate, 360-62 
weighted least squares, 420 
with time-dependent errors, 413-17 
Regression coefficients (see 
Regression) 
Repeated measures designs, 279-83, - 
328-32 
Residuals, 364, 381-83, 389, 455, 643 
Roy’s largest root, 336, 398 


Sample: 
geometry, 119 
Sample splitting, 520, 599, 742 
Scree plot, 445 
Simultaneous confidence ellipses: 
as projections, 258-60 
Simultaneous confidence intervals: 
comparisons of, 229-31, 234, 238 
for components of mean vectors, 
225, 232, 235 
for contrasts, 281 
development, 223-26 
for differences in mean vectors, 
288, 291-92 
for paired comparisons, 276 
as projections, 258 
for regression coefficients, 371 
for treatment effects, 309, 
317-18 
Single linkage (see Cluster analysis) 
Singular matrix, 95 
Singular-value decomposition, 100, 
721, 725, 728 
Special causes (of variation), 239 
Specific variance, 484 
Spectral decomposition, 61, 100 
SStress, 709 


Standard deviation: 
population, 72 
sample, 7 — 
Standard deviation matrix: 
population, 72 
sample, 139 
Standardized observations, 8, 449 
Standardized variables, 436 
Stars, 26 
Strategy for multivariate 
comparisons, 337 
Stress, 708 
Studentized residuals, 381 
Sufficient statistics, 173 
Sum of squares and cross products 
matrices: 
between, 302 
total, 302 
within, 302 


Time dependence (in multivariate 
observations), 256-57, 413-17 
T*-statistic: 
definition of, 211-13 
distribution of, 212 
invariance property of, 215-16 
in quality control, 243, 247-48, 250- 
51, 462 
in profile analysis, 324, 325 
for repeated measures designs, 280 
single-sample, 211-12 
two-sample, 286 
two-sample, approximate, 294 


Subject Index 773 


Trace of a matrix, 96 
Transformations of data, 192-200 


Variables: 
canonical, 541-42, 550-51 
dummy, 363 
predictor, 361 
response, 361 
standardized, 436 
Variance: 
definition, 68 
generalized, 123, 134 
geometrical interpretation of, 119 
total sample, 137, 442, 451, 561 
Varimax rotation criterion, 507 
Vectors: 
addition, 51, 83 
angle between, 52, 85 
basis, 84 
definition of, 49, 82 
inner product, 52, 53, 85 
length of, 51, 53, 84 
linearly dependent, 53, 83 
linearly independent, 53, 83 
linear span, 83 
perpendicular (orthogonal), 53, 86 
projection of, 54, 86, 87 
random, 66 
scalar multiplication, 50, 82 
unit, 51 
vector space, 83 


Wilks’s lambda, 217, 303, 398 
Wishart distribution, 174 


